3.1 Basics of {recipes}
The recipes package is designed to replace the stats::model.matrix function that you’re probably familiar with. For example, if you fit a model like the below
##
## Call:
## lm(formula = bill_length_mm ~ species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9338 -2.2049 0.0086 2.0662 12.0951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7914 0.2409 161.05 <2e-16 ***
## speciesChinstrap 10.0424 0.4323 23.23 <2e-16 ***
## speciesGentoo 8.7135 0.3595 24.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.96 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7078, Adjusted R-squared: 0.7061
## F-statistic: 410.6 on 2 and 339 DF, p-value: < 2.2e-16
You can see that our species
column, which has the values Adelie, Gentoo, Chinstrap, is automatically dummy-coded for us, with the first level in the factor variable set as the reference group.
The {recipes} package forces you to be a bit more explicit in these decisions. But it also has a much wider range of modifications it can make to the data. Another piece that is slightly different is that, in the above, you may not have even realized stats::model.matrix
was doing anything for you because it’s wrapped within the stats::lm
modeling code. But with {recipes}, you make the modifications to your data first, then conduct your analysis.
The {recipes} package allows you to create a blueprint (or recipe) to apply to a given dataset, without actually applying those operations. We can then use this blueprint iteratively across sets of data (e.g., folds) as well as on new (potentially unseen) data that has the same structure (variables). This process helps avoid data leakage because all operations are carried forward and applied together, and no operations are conducted until explicitly requested.