3.4 Dealing with low variance predictors
Occasionally you have (or can create) variables that are highly imbalanced. A common example might include a gender variable that takes on the values “male”, “female”, “non-binary”, “other”, and “refused to answer”. Once you dummy-code a variable like this, it is possible that one or more of the categories may be so infrequent that it makes modeling that category difficult. This is not to say that these categories are not important, particularly when considering the representation of your training dataset to real-world applications (and any demographic variable is going to be associated with issues of ethics). Ignoring this variation may lead to systematic biases in model predictions. However, you also regularly have to make compromises to get models to work and be useful. One of those compromises often includes (with many types of variables, not just demographics) dropping highly imbalanced predictors.
Let’s look back at our statewide testing data. Let’s bake
the final recipe from our Creating a recipe section on the training data (that we fed to the recipe) and look at the dummy variables that are created.
rec <- recipe(score ~ ., train) %>%
update_role(contains("id"), ncessch, new_role = "id vars") %>%
step_mutate(lang_cd = factor(ifelse(is.na(lang_cd), "E", lang_cd)),
tst_dt = lubridate::mdy_hms(tst_dt)) %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal())
baked <- rec %>%
prep() %>%
bake(new_data = NULL)
Below is a table of just the categorical variables and the frequency of each value.
The relative frequency of many of these looks fine, but for some one category has very low frequency. For example, ayp_lep_M
has 576 observations (from our random 2% sample) that were \(0\), and only 2 that were \(1\). This is the same for ayp_lep_S
. We may therefore consider applying a near-zero variance filter to drop these columns. Let’s try this, and then we’ll talk a bit more about what the filter is actually doing.
rec_nzv <- rec %>%
step_nzv(all_predictors())
baked_rm_nzv <- rec_nzv %>%
prep() %>%
bake(new_data = NULL)
Let’s look at what columns are in baked
that were removed from baked_rm_nzv
.
## [1] "ethnic_cd_B" "ethnic_cd_I" "ethnic_cd_P" "migrant_ed_fg_Y"
## [5] "ind_ed_fg_Y" "ayp_lep_B" "ayp_lep_M" "ayp_lep_W"
## [9] "stay_in_dist_Y" "stay_in_schl_Y" "dist_sped_Y" "trgt_assist_fg_Y"
## [13] "ayp_dist_partic_Y" "ayp_schl_partic_Y" "ayp_dist_prfrm_Y" "ayp_schl_prfrm_Y"
## [17] "rc_dist_partic_Y" "rc_schl_partic_Y" "rc_dist_prfrm_Y" "rc_schl_prfrm_Y"
## [21] "lang_cd_E" "tst_atmpt_fg_Y" "grp_rpt_dist_partic_Y" "grp_rpt_schl_partic_Y"
## [25] "grp_rpt_dist_prfrm_Y" "grp_rpt_schl_prfrm_Y"
As you can see, the near-zero variance filter has been quite aggressive here, removing 26 columns. Looking back at our table of variables, we can see that, for example, there are 55 students coded Black out of 2841, and it could be reasonably argued that this column is worth keeping in the model.
So how is step_nzv
working and how can we adjust it to be not quite so aggressive? Variables are flagged for being near-zero variance if they
- Have very few unique values, and
- The frequency ratio for the most common value to the second most common value is large
These criteria are implemented in step_nzv
through the unique_cut
and freq_cut
arguments, respectively. The former is estimated as the number of unique values divided by the total number of samples (length of the column) times 100 (i.e., it is a percent), while the latter is estimated by the most common level frequency divided by the second most common level frequency. The default for unique_cut
is 10, while the default for freq_cut
is \(95/5 = 19\). For a column to be “caught” by a near-zero variance filter, and removed from the training set, it must be below the specified unique_cut
and above the specified freq_cut
.
In the case of ethnic_cd_B
, we see that there are two unique values, \(0\) and \(1\) (because it’s a dummy-coded variable). There are 2841 rows, so the unique_cut
value is \((2 / 2841) \times 100 = 0.07\). The frequency ratio is \(2786/55 = 50.65\). It therefore meets both of the default criteria (below unique_cut
and above freq_cut
) and is removed.
If you’re applying a near-zero variance filter on dummy variables, there will always be only 2 values, leading to a small unique_cut
. This might encourage you to up the freq_cut
to a higher value. Let’s try this approach
rec_nzv2 <- rec %>%
step_nzv(all_predictors(),
freq_cut = 99/1)
baked_rm_nzv2 <- rec_nzv2 %>%
prep() %>%
bake(new_data = NULL)
removed_columns2 <- names(baked)[!(names(baked) %in% names(baked_rm_nzv2))]
removed_columns2
## [1] "ind_ed_fg_Y" "ayp_lep_M" "ayp_lep_W" "dist_sped_Y"
## [5] "ayp_dist_partic_Y" "rc_dist_partic_Y" "tst_atmpt_fg_Y" "grp_rpt_dist_partic_Y"
## [9] "grp_rpt_dist_prfrm_Y"
Removing near-zero variance dummy variables can be a bit tricky because they will essentially always meet the unique_cut
criteria. But it can be achieved by fiddling with the freq_cut
variable and, actually, could be considered part of your model tuning process. In this case, we’ve set it so variables will be removed if greater than 99 out of every 100 cases is the same. This led to only 13 variables being flagged and removed. But we could continue on even further specifying, for example, that 499 out of every 500 must be the same for the variable to be flagged. At some point, however, you’ll end up with variables that have such low frequency that model estimation becomes difficult, which is the purpose of applying the near-zero variance filter in the first place.