In our research group we often have people creating statistical models that end up in publications but, most of the time, the practical implementation of those models is lacking. I mean, we have a bunch of barely functioning code that is very difficult to use in a reliable way in operations of the breeding programs. I was very keen on continue using one of the models in our research, enough to rewrite and document the model fitting, and then create another package for using the.model in operations.
Unfortunately, neither the data nor the model are mine to give away, so I can’t share them (yet). But I hope these notes will help you in you are in the same boat and need to use your models (or ‘you’ are in fact future me, who tend to forget how or why I wrote code in a specific way).
A basic motivational example
Let’s start with a simple example: linear regression. We want to predict a response
using a predictor
variable, and then we can predict the response for new values of the predictor contained in new_data
with:
my_model <- lm(response ~ predictor, data = my_data) predictions <- predict(my_model, newdat = new_data) # Saving the model object save(my_model, 'model_file.Rda')
The model coefficients needed to predict new values are stored in the my_model
object. If we want to use the model elsewhere, we can save the object as an .Rda
file, in this case model_file.Rda
.
We can later read the model file in, say, a different project and get new predictions using:
load('model_file.Rda') more_predictions <- predict(my_model, newdat = yet_another_new_data)
Near-infrared Spectroscopy
Near-infrared spectroscopy is the stuff of CSI and other crime shows. We measure the reflection at different wavelengths and run a regression analysis using what we want to predict as Y in the model. The number of predictors (wavelengths) is much larger than in the previous example—1,296 for the NIR machine we are using—so it is not unusual to have more predictors than observations. NIR spectra are often trained using pls()
(from the pls
package) with help from functions from the prospectr
package.
I could still use the save/load
approach from the motivational example to store and reuse the model object created with pls
but, instead, I wanted to implement the model, plus some auxiliary functions, in a package to make the functions easier to use in our lab.
I had two issues/struggles/learning opportunities that I needed to sort out to get this package working:
1. How to automatically load the model object when attaching the package?
Normally, datasets and other objects go in the data
folder, where they are made available to the user. Instead, I wanted to make the object internally available. The solution turned out to be quite straightforward: save the model object to a file called sysdata.rda
(either in the R
or data
folders of the package). This file is automatically loaded when we run library(package_name)
. We just need to create that file with something like:
save(my_model, 'sysdata.rda')
2. How to make predict.pls work in the package?
I was struggling to use the predict
function, as in my head it was being provided by the pls
package. However, pls
is only extending the predict
function, which comes with the default R installation but is part of the stats
package. At the end, sorted it out with the following Imports
, Depends
and LazyData
in the DESCRIPTION
file:
Imports: prospectr, stats Depends: pls Encoding: UTF-8 LazyData: Yes
Now it is possible to use predict, just remember to specify the package where it is coming from, as in:
stats::predict(my_model, ncomp = n_components, newdata = spectra, interval = 'confidence')
Nothing groundbreaking, I know, but spent a bit of time sorting out that couple of annoyances before everything fell into place. Right now we are using the models in a much easier and reproducible way.