3.2 Creating a recipe

Let’s read in some data and begin creating a basic recipe. We’ll work with the simulated statewide testing data introduced previously. This is a fairly decent sized dataset, and since we’re just illustrating concepts here, we’ll pull a random sample of 2% of the total data to make everything run a bit quicker. We’ll also remove the classification variable, which is just a categorical version of score, our outcome.

In the chunk below, we read in the data, sample a random 2% of the data (being careful to set a seed first so our results are reproducible), split it into training and test sets, and extract just the training dataset. We’ll hold off on splitting it into CV folds for now.

library(tidyverse)
library(tidymodels)

set.seed(8675309)
full_train <- read_csv("https://github.com/uo-datasci-specialization/c4-ml-fall-2020/raw/master/data/train.csv") %>% 
  slice_sample(prop = 0.02) %>% 
  select(-classification)

splt <- initial_split(full_train)
train <- training(splt)

A quick reminder, the data look like this

And you can see the full data dictionary on the Kaggle website here.

When creating recipes, we can still use the formula interface to define how the data will be modeled. In this case, we’ll say that the score column is predicted by everything else in the data frame.

rec <- recipe(score ~ ., data = train)

Notice that I still declare the dataset, even though this is just a blueprint. It uses the dataset I provide to get the names of the columns from the dataset, but it doesn’t actually do anything with this dataset (unless we ask it to). Let’s look at what this recipe looks like

rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38

Notice it just states that this is a data recipe in which we have specified 1 outcome variable and 39 predictors.

We can prep this recipe to learn more

prep(rec)
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38
## 
## Training data contained 2841 data points and 2841 incomplete rows.

Notice we now get an additional message about how many rows are in the data, and how many of these rows contain missing (incomplete data). So the recipe is the blueprint, and we prep the recipe to get it to actually go into the data and conduct the operations. The dataset it has now, however, is just a placeholder than can be substituted in for any other dataset with an equivalent structure.

But of course modeling score as the outcome with everything else predicting it (as is) is not reasonable for multiple reasons. We have many ID variables, for one, and we also multipe categorical variables. For some methods (like tree-based models) it might be okay to leave these as is, but for others (like any model in the linear regression family) we’ll wan to encode them somehow (e.g., dummy code).

We can do these operations by adding steps to our recipe. In the first step, we’ll update the role of all the ID variables so they are not included among the predictors. In the second, we will dummy code all nominal variables.

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_dummy(all_nominal())

When updating the roles, we can change the variable label (text passed to the new_role argument) to be anything we want, so long as it’s not "predictor" or "outcome".

Notice in the above I am also using helper functions to apply the operations to all variables of a specific type. There are five main helper functions: all_predictors(), all_outcomes(), all_nominal(), all_numeric() and has_role(). You can use these together, including with negation (e.g., -all_outcomes to specify the operation should not apply to the outcome variable(s)) to select any set of variables you want to apply the operation to.

Let’s try prepping this recipe

prep(rec)
## Error: Only one factor level in lang_cd

Uh oh! We have an error. Our recipe is trying to dummy code the lang_cd variable, but it has only one level. It’s kind of hard to dummy-code a constant!

Luckily, we can expand our recipe to first remove any zero-variance predictors, like so

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_zv(all_predictors()) %>% 
  step_dummy(all_nominal())

The zv part stands for “zero variance” and should take care of this problem. Let’s try again.

prep(rec)
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    id vars          6
##    outcome          1
##  predictor         32
## 
## Training data contained 2841 data points and 2841 incomplete rows. 
## 
## Operations:
## 
## Zero variance filter removed calc_admn_cd, lang_cd [trained]
## Dummy variables from gndr, ethnic_cd, tst_bnch, tst_dt, migrant_ed_fg, ind_ed_fg, sp_ed_fg, tag_ed_fg, ... [trained]

Beautiful! Note we do still get a warning here, but I’ve omitted it in the text (we’ll take care of it later). Our recipe says we now have 6 ID variables, 1 outcome, and 33 predictors, with 2841 data points (rows of data). The calc_admn_cd and lang_cd variables have been removed because they have zero variance, and several variables have been dummy coded, including gndr and ethnic_cd, among others.

Let’s dig just a bit deeper here though. What’s going on with these zero-variance variables? Let’s look back at the training data.

train %>% 
  count(calc_admn_cd)
## # A tibble: 1 x 2
##   calc_admn_cd     n
##   <lgl>        <int>
## 1 NA            2841
train %>% 
  count(lang_cd)
## # A tibble: 2 x 2
##   lang_cd     n
##   <chr>   <int>
## 1 S          51
## 2 <NA>     2790

So at least in our sample, calc_admn_cd really is just fully missing, which means it might as well be dropped because it’s providing us exactly nothing. But that’s not the case with lang_cd. It has two values, NA and S. This variable represents the language the test was administered in and the NA values are actually meaningful here because they are the the “default” administration, meaning English. So rather than dropping these, let’s mutate them to transform the NA values to "E" for English. We could reasonably do this inside or outside the recipe, but a good rule of thumb is, if it can go in the recipe, put it in the recipe. It can’t hurt, and doing operations outside of the recipe risks data leakage.

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_mutate(lang_cd = ifelse(is.na(lang_cd), "E", lang_cd)) %>% 
  step_zv(all_predictors()) %>% 
  step_dummy(all_nominal())

Let’s take a look at what our data would actually look like when applying this recipe now. First, we’ll prep the recipe

prepped <- prep(rec)
prepped
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    id vars          6
##    outcome          1
##  predictor         32
## 
## Training data contained 2841 data points and 2841 incomplete rows. 
## 
## Operations:
## 
## Variable mutation for lang_cd [trained]
## Zero variance filter removed calc_admn_cd [trained]
## Dummy variables from gndr, ethnic_cd, tst_bnch, tst_dt, migrant_ed_fg, ind_ed_fg, sp_ed_fg, tag_ed_fg, ... [trained]

And we see that lang_cd is no longer being caught by the zero variance filter. Next we’ll bake the recipe to actually apply it to our data. If we specify new_data = NULL, bake will apply the operation to the data we specified in the recipe. But we can also pass new data as an additional argument and it will apply the operations to that data instead of the data specified in the recipe.

bake(prepped, new_data = NULL)
## # A tibble: 2,841 x 104
##        id attnd_dist_inst… attnd_schl_inst… enrl_grd partic_dist_ins… partic_schl_ins… lang_cd ncessch   lat   lon
##     <dbl>            <dbl>            <dbl>    <dbl>            <dbl>            <dbl> <fct>     <dbl> <dbl> <dbl>
##  1 154420             2057              481        7             2057              481 E       4.11e11  42.2 -122.
##  2 248739             1978              264        3             1978              264 E       4.11e11  44.3 -122.
##  3 126093             2243             1160        4               NA               NA E       4.10e11  45.5 -123.
##  4 225721             2180              834        5             2180              834 E       4.11e11  45.5 -123.
##  5 110026             2183              942        6             2183              942 E       4.11e11  45.5 -122.
##  6 149751             2084              563        3             2084              563 E       4.10e11  44.1 -123.
##  7 116315             2206             3426        4             2206             3426 E       4.11e11  45.8 -119.
##  8 145793             2180              883        5             2180              883 E       4.11e11  45.5 -123.
##  9  73617             1976             5292        7             1976             5292 E       4.10e11  44.1 -121.
## 10 226521             2043              394        3             2043              394 E       4.10e11  42.4 -123.
## # … with 2,831 more rows, and 94 more variables: score <dbl>, gndr_M <dbl>, ethnic_cd_B <dbl>, ethnic_cd_H <dbl>,
## #   ethnic_cd_I <dbl>, ethnic_cd_M <dbl>, ethnic_cd_P <dbl>, ethnic_cd_W <dbl>, tst_bnch_X2B <dbl>,
## #   tst_bnch_X3B <dbl>, tst_bnch_G4 <dbl>, tst_bnch_G6 <dbl>, tst_bnch_G7 <dbl>, tst_dt_X3.19.2018.0.00.00 <dbl>,
## #   tst_dt_X3.20.2018.0.00.00 <dbl>, tst_dt_X3.21.2018.0.00.00 <dbl>, tst_dt_X3.23.2018.0.00.00 <dbl>,
## #   tst_dt_X4.10.2018.0.00.00 <dbl>, tst_dt_X4.11.2018.0.00.00 <dbl>, tst_dt_X4.12.2018.0.00.00 <dbl>,
## #   tst_dt_X4.13.2018.0.00.00 <dbl>, tst_dt_X4.16.2018.0.00.00 <dbl>, tst_dt_X4.17.2018.0.00.00 <dbl>,
## #   tst_dt_X4.18.2018.0.00.00 <dbl>, tst_dt_X4.19.2018.0.00.00 <dbl>, tst_dt_X4.20.2018.0.00.00 <dbl>,
## #   tst_dt_X4.23.2018.0.00.00 <dbl>, tst_dt_X4.24.2018.0.00.00 <dbl>, tst_dt_X4.25.2018.0.00.00 <dbl>,
## #   tst_dt_X4.26.2018.0.00.00 <dbl>, tst_dt_X4.27.2018.0.00.00 <dbl>, tst_dt_X4.3.2018.0.00.00 <dbl>,
## #   tst_dt_X4.30.2018.0.00.00 <dbl>, tst_dt_X4.5.2018.0.00.00 <dbl>, tst_dt_X4.6.2018.0.00.00 <dbl>,
## #   tst_dt_X4.9.2018.0.00.00 <dbl>, tst_dt_X5.1.2018.0.00.00 <dbl>, tst_dt_X5.10.2018.0.00.00 <dbl>,
## #   tst_dt_X5.11.2018.0.00.00 <dbl>, tst_dt_X5.14.2018.0.00.00 <dbl>, tst_dt_X5.15.2018.0.00.00 <dbl>,
## #   tst_dt_X5.16.2018.0.00.00 <dbl>, tst_dt_X5.17.2018.0.00.00 <dbl>, tst_dt_X5.18.2018.0.00.00 <dbl>,
## #   tst_dt_X5.2.2018.0.00.00 <dbl>, tst_dt_X5.21.2018.0.00.00 <dbl>, tst_dt_X5.22.2018.0.00.00 <dbl>,
## #   tst_dt_X5.23.2018.0.00.00 <dbl>, tst_dt_X5.24.2018.0.00.00 <dbl>, tst_dt_X5.25.2018.0.00.00 <dbl>,
## #   tst_dt_X5.29.2018.0.00.00 <dbl>, tst_dt_X5.3.2018.0.00.00 <dbl>, tst_dt_X5.30.2018.0.00.00 <dbl>,
## #   tst_dt_X5.31.2018.0.00.00 <dbl>, tst_dt_X5.4.2018.0.00.00 <dbl>, tst_dt_X5.7.2018.0.00.00 <dbl>,
## #   tst_dt_X5.8.2018.0.00.00 <dbl>, tst_dt_X5.9.2018.0.00.00 <dbl>, tst_dt_X6.1.2018.0.00.00 <dbl>,
## #   tst_dt_X6.4.2018.0.00.00 <dbl>, tst_dt_X6.5.2018.0.00.00 <dbl>, tst_dt_X6.6.2018.0.00.00 <dbl>,
## #   tst_dt_X6.7.2018.0.00.00 <dbl>, tst_dt_X6.8.2018.0.00.00 <dbl>, migrant_ed_fg_Y <dbl>, ind_ed_fg_Y <dbl>,
## #   sp_ed_fg_Y <dbl>, tag_ed_fg_Y <dbl>, econ_dsvntg_Y <dbl>, ayp_lep_B <dbl>, ayp_lep_E <dbl>, ayp_lep_F <dbl>,
## #   ayp_lep_M <dbl>, ayp_lep_N <dbl>, ayp_lep_W <dbl>, ayp_lep_X <dbl>, ayp_lep_Y <dbl>, stay_in_dist_Y <dbl>,
## #   stay_in_schl_Y <dbl>, dist_sped_Y <dbl>, trgt_assist_fg_Y <dbl>, ayp_dist_partic_Y <dbl>,
## #   ayp_schl_partic_Y <dbl>, ayp_dist_prfrm_Y <dbl>, ayp_schl_prfrm_Y <dbl>, rc_dist_partic_Y <dbl>,
## #   rc_schl_partic_Y <dbl>, rc_dist_prfrm_Y <dbl>, rc_schl_prfrm_Y <dbl>, tst_atmpt_fg_Y <dbl>,
## #   grp_rpt_dist_partic_Y <dbl>, grp_rpt_schl_partic_Y <dbl>, grp_rpt_dist_prfrm_Y <dbl>,
## #   grp_rpt_schl_prfrm_Y <dbl>

And now we can actually see the dummy-coded categorical variables, along with the other operations we requested. For example, calc_admn_cd is not in the dataset. Notice the ID variables are output though, which makes sense because they are often neccessary for joining with other data sources. But it’s important to realize that they are output (i.e., all variables are returned, regardless of role) because if we passed this directly to a model they would be included as predictors. Note that there may be reasons you would want to include a school and/or district level ID variable in your modeling, but you certainly would not want redundant variables.

We do still have one minor issue with this recipe though, which is pretty evident when looking at the column names of our baked dataset. The tst_dt variable, which specifies the data the test was taken, was treated as a categorical variable because it read in as a character vector. That means all the dates are being dummy coded! Let’s fix this by just transforming it to a date within our step_mutate.

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_mutate(lang_cd = factor(ifelse(is.na(lang_cd), "E", lang_cd)),
              tst_dt = lubridate::mdy_hms(tst_dt)) %>% 
  step_zv(all_predictors()) %>% 
  step_dummy(all_nominal())

And now when we prep/bake the dataset it’s still a date variable, which is what we probably want (it will modeled as a numeric variable).

rec %>% 
  prep() %>% 
  bake(new_data = NULL)
## # A tibble: 2,841 x 54
##        id attnd_dist_inst… attnd_schl_inst… enrl_grd tst_dt              partic_dist_ins… partic_schl_ins… ncessch
##     <dbl>            <dbl>            <dbl>    <dbl> <dttm>                         <dbl>            <dbl>   <dbl>
##  1 154420             2057              481        7 2018-05-29 00:00:00             2057              481 4.11e11
##  2 248739             1978              264        3 2018-05-14 00:00:00             1978              264 4.11e11
##  3 126093             2243             1160        4 2018-05-30 00:00:00               NA               NA 4.10e11
##  4 225721             2180              834        5 2018-05-30 00:00:00             2180              834 4.11e11
##  5 110026             2183              942        6 2018-06-05 00:00:00             2183              942 4.11e11
##  6 149751             2084              563        3 2018-05-18 00:00:00             2084              563 4.10e11
##  7 116315             2206             3426        4 2018-05-18 00:00:00             2206             3426 4.11e11
##  8 145793             2180              883        5 2018-05-09 00:00:00             2180              883 4.11e11
##  9  73617             1976             5292        7 2018-05-24 00:00:00             1976             5292 4.10e11
## 10 226521             2043              394        3 2018-04-26 00:00:00             2043              394 4.10e11
## # … with 2,831 more rows, and 46 more variables: lat <dbl>, lon <dbl>, score <dbl>, gndr_M <dbl>,
## #   ethnic_cd_B <dbl>, ethnic_cd_H <dbl>, ethnic_cd_I <dbl>, ethnic_cd_M <dbl>, ethnic_cd_P <dbl>,
## #   ethnic_cd_W <dbl>, tst_bnch_X2B <dbl>, tst_bnch_X3B <dbl>, tst_bnch_G4 <dbl>, tst_bnch_G6 <dbl>,
## #   tst_bnch_G7 <dbl>, migrant_ed_fg_Y <dbl>, ind_ed_fg_Y <dbl>, sp_ed_fg_Y <dbl>, tag_ed_fg_Y <dbl>,
## #   econ_dsvntg_Y <dbl>, ayp_lep_B <dbl>, ayp_lep_E <dbl>, ayp_lep_F <dbl>, ayp_lep_M <dbl>, ayp_lep_N <dbl>,
## #   ayp_lep_W <dbl>, ayp_lep_X <dbl>, ayp_lep_Y <dbl>, stay_in_dist_Y <dbl>, stay_in_schl_Y <dbl>,
## #   dist_sped_Y <dbl>, trgt_assist_fg_Y <dbl>, ayp_dist_partic_Y <dbl>, ayp_schl_partic_Y <dbl>,
## #   ayp_dist_prfrm_Y <dbl>, ayp_schl_prfrm_Y <dbl>, rc_dist_partic_Y <dbl>, rc_schl_partic_Y <dbl>,
## #   rc_dist_prfrm_Y <dbl>, rc_schl_prfrm_Y <dbl>, lang_cd_E <dbl>, tst_atmpt_fg_Y <dbl>,
## #   grp_rpt_dist_partic_Y <dbl>, grp_rpt_schl_partic_Y <dbl>, grp_rpt_dist_prfrm_Y <dbl>,
## #   grp_rpt_schl_prfrm_Y <dbl>

3.2.1 Order matters

It’s important to realize that the order of the steps matters. In our recipe, we first declare ID variables as having a different role than predictors or outcomes, we then modify two variables, remove zero-variance predictors, and finally dummy code all categorical (nominal) variables. What happens if we instead dummy code and then remove zero-variance predictors?

rec <- recipe(score ~ ., train) %>% 
  step_dummy(all_nominal()) %>% 
  step_zv(all_predictors()) 

prep(rec)
## Error: Only one factor level in lang_cd

We end up with the error, whereas we don’t if we remove zero variance predictors and then dummy code

rec <- recipe(score ~ ., train) %>% 
  step_zv(all_predictors()) %>% 
  step_dummy(all_nominal()) 

prep(rec)
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38
## 
## Training data contained 2841 data points and 2841 incomplete rows. 
## 
## Operations:
## 
## Zero variance filter removed calc_admn_cd, lang_cd [trained]
## Dummy variables from gndr, ethnic_cd, tst_bnch, tst_dt, migrant_ed_fg, ind_ed_fg, sp_ed_fg, tag_ed_fg, ... [trained]

This is true for all steps, and may occasionally lead to you needing to apply the same operation at multiple steps (e.g., a near zero variance filter could be applied before and after dummy-coding).

All of the above serves as a basic introduction to developing a recipe, and the what follows goes into more detail on specific feature engineering pieces. For complete documentation on all possible recipe steps, please see the documentaion.