3.5 Missing data
If we look closely at our statewide testing data, we will see that there is a considerable amount of missingness. In fact, every row of the data frame has at least one value that is missing. The amount to which missing data impacts your work varies by field, but in most fields you’re likely to run into situations where you have to handle missing data in some way. The purpose of this section is to discuss a few approaches (as implemented through the {recipes} package) to handling missingness for predictive modeling purposes. Note that this is not a comprehensive discussion on the topic (for which, we recommend Little and Rubin (2002)), but is instead an applied discussion of what you can do. As with many aspects of data analysis, generally, there is no single approach that will always work best, and it’s worth trying a few different approaches in your model development to see how different choices impact your model performance.
There are three basic ways of handling missing data:
- Omit rows of the data frame that include missing values
- Encode or Impute the missing data
- Ignore the missing data and estimate from the available data
The last option is not always feasible and will depend the modeling framework you’re working within. Some estimation procedures can also lend themselves to efficient handling of missing data (for example, imputation via the posterior distribution with Bayesian estimation). In this section, we’ll mostly focus on the first three. Additionally, we will only really be concerned with missingness on the predictor variables here, rather than the outcome. Generally, missing data on the predictors is a much more difficult problem than missingness on the outcome, because most models assume you have complete data across your predictors.
3.5.1 Missing data via {recipes}
3.5.1.1 Omission
We can omit missing data with step_naomit
. This will remove any row that has any missing data. Let’s see how this impacts our data, working with our same recipe we finished up with in the Creating a recipe section. I’ve placed the recipe here again so we don’t have to go back to remind ourselves what we did previously.
rec <- recipe(score ~ ., train) %>%
update_role(contains("id"), ncessch, new_role = "id vars") %>%
step_mutate(lang_cd = factor(ifelse(is.na(lang_cd), "E", lang_cd)),
tst_dt = lubridate::mdy_hms(tst_dt)) %>%
step_zv(all_predictors())
na_omit_data <- rec %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal()) %>%
prep() %>%
bake(new_data = NULL)
nrow(na_omit_data)
## [1] 561
As can be seen above, when we omit any row with missing data we end up with only 561 rows out of the original 2841 rows in the training data (or approximately 20% of the original data). This level of data omission is highly likely to introduce systematic biases into your model prediction. Generally, step_naomit
should only be used when developing preliminary models, where you’re just trying to get code to run. When you get to the point where you’re actually trying to improve performance, you should consider alternative means
3.5.1.2 Encoding and simple imputation
Encoding missing data is similar to imputation. In imputation, we replace the missing value with something we thing could have reasonably been the real value, if it were observed. When we encode missing data we are creating values that will be included in the modeling process. For example, with categorical variables, we could replace the missingess with a “missing” level, which would then get its own dummy code (if we were using dummy coding to encode the categorical variables).
I mentioned in the Creating a recipe section that we were getting warnings but I was omitting them in the text. The reason is because some of these columns have missing data. If we want to avoid this warning, we have to add an additional step to our recipe to encode the missing data in the categorical variables. This step is called step_unknown
and it replaces missing values with "unknown"
. Let’s do this for all categorical variables, and omit any rows that are missing on numeric columns.
na_encode_data <- rec %>%
step_unknown(all_nominal()) %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal()) %>%
prep() %>%
bake(new_data = NULL)
nrow(na_encode_data)
## [1] 2784
Notice in the above that when I call step_naomit
I state that it should be applied to all_predictors
because I’ve already encoded the nominal predictors in the previous step. This approach allows us to capture 98% of the original data. And as a bonus, we’ve removed ourselves of the warnings (note - we might also want to apply step_novel
for any future data that had levels outside of our training data - see Handling new levels).
Just a slight step up in complexity from omission of rows with missing data is to impute them with sample descriptive statistics, such as the mean or the median. Generally, I’ve found median imputation works better than mean imputation, but that could be related to the types of data I work with most frequently. Let’s switch datasets so we can see what’s happening more directly.
Let’s look at the airquality dataset, which ships with R.
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
As we can see, Solar.R
is missing for observations 5 and 6. Let’s compute the sample mean and median for this column.
## [1] 185.9315
## [1] 205
If we use mean or median imputation, we just replace the missing values with these sample statistics. Let’s do this with {recipes}, assuming we’ll be fitting a model where Ozone
is the outcome, predicted by all other variables in the dataset.
recipe(Ozone ~ ., data = airquality) %>%
step_meanimpute(all_predictors()) %>%
prep() %>%
bake(new_data = NULL)
## # A tibble: 153 x 6
## Solar.R Wind Temp Month Day Ozone
## <int> <dbl> <int> <int> <int> <int>
## 1 190 7.4 67 5 1 41
## 2 118 8 72 5 2 36
## 3 149 12.6 74 5 3 12
## 4 313 11.5 62 5 4 18
## 5 186 14.3 56 5 5 NA
## 6 186 14.9 66 5 6 28
## 7 299 8.6 65 5 7 23
## 8 99 13.8 59 5 8 19
## 9 19 20.1 61 5 9 8
## 10 194 8.6 69 5 10 NA
## # … with 143 more rows
As we can see, the value \(186\) has been imputed for rows 5 and 6, which is the integer version of the sample mean (an integer was imputed because the column was already an integer, and not a double).
Let’s try the same thing with median imputation
recipe(Ozone ~ ., data = airquality) %>%
step_medianimpute(all_predictors()) %>%
prep() %>%
bake(new_data = NULL)
## # A tibble: 153 x 6
## Solar.R Wind Temp Month Day Ozone
## <int> <dbl> <int> <int> <int> <int>
## 1 190 7.4 67 5 1 41
## 2 118 8 72 5 2 36
## 3 149 12.6 74 5 3 12
## 4 313 11.5 62 5 4 18
## 5 205 14.3 56 5 5 NA
## 6 205 14.9 66 5 6 28
## 7 299 8.6 65 5 7 23
## 8 99 13.8 59 5 8 19
## 9 19 20.1 61 5 9 8
## 10 194 8.6 69 5 10 NA
## # … with 143 more rows
And as we would expect, the missingness has now been replaced with values of \(205\).
Sometimes you have time series data, or there is a date variable in the dataset that accounts for a meaningful proportion of the variance. In these cases, you might consider step_rollimpute
, which provides a conditional median imputation based on time, and you can set the size of the window from which to calculate the median. In still other cases it may make sense to just impute with the lowest observed value (i.e., assume a very small amount of the predictor), which can be accomplished with step_lowerimpute
.
These simple imputation techniques are fine to use when developing models. However, it’s an area that may be worth returning to as you start to refine your model to see if you can improve performance.
3.5.1.3 Modeling the missingness
Another alternative for imputation is to fit a statistical model with the column you want to impute modeled as the outcome, with all other columns (minus the actual outcome) predicting it. We then use that model for the imputation. Let’s first consider a linear regression model. We’ll fit the same model we specified in our recipe, using the airquality data.
##
## Call:
## lm(formula = Solar.R ~ ., data = airquality[, -1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -182.945 -67.348 5.295 73.781 170.068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.5960 84.4424 -0.173 0.863016
## Wind 2.1661 2.2633 0.957 0.340171
## Temp 3.7023 0.9276 3.991 0.000105 ***
## Month -13.2640 5.4525 -2.433 0.016242 *
## Day -1.0631 0.8125 -1.308 0.192875
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 85.14 on 141 degrees of freedom
## (7 observations deleted due to missingness)
## Multiple R-squared: 0.131, Adjusted R-squared: 0.1063
## F-statistic: 5.313 on 4 and 141 DF, p-value: 0.0005145
Notice that I’ve dropped the first column here, which is Ozone
, our actual outcome. The model above has been fit using the equivalent of step_naomit
, otherwise known as listwise deletion, where any row with any missing data is removed.
We can now use the coefficients from this model to impute the missing values in Solar.R
. For example, row 6 in the dataset had a missing value on Solar.R
and following values for all other variables
Using our model, we would predict the following score for this missing value
## 1
## 189.3325
Let’s try this using {recipes}.
recipe(Ozone ~ ., data = airquality) %>%
step_impute_linear(all_predictors()) %>%
prep() %>%
bake(new_data = NULL)
## # A tibble: 153 x 6
## Solar.R Wind Temp Month Day Ozone
## <int> <dbl> <int> <int> <int> <int>
## 1 190 7.4 67 5 1 41
## 2 118 8 72 5 2 36
## 3 149 12.6 74 5 3 12
## 4 313 11.5 62 5 4 18
## 5 152 14.3 56 5 5 NA
## 6 189 14.9 66 5 6 28
## 7 299 8.6 65 5 7 23
## 8 99 13.8 59 5 8 19
## 9 19 20.1 61 5 9 8
## 10 194 8.6 69 5 10 NA
## # … with 143 more rows
And we see two important things here. First, row 6 for Solar.R
is indeed as we expected it to be (albeit, in integer form). Second, the imputed values for rows 5 and 6 are now different, which is the first time we’ve seen this via imputation.
The same basic approach can be used for essentially any statistical model. The {recipes} package has currently implemented linear imputation (as above), \(k\)-nearest neighbor imputation, and bagged imputation (via bagged trees). Let’s see how rows 5 and 6 differ with these approaches.
recipe(Ozone ~ ., data = airquality) %>%
step_knnimpute(all_predictors()) %>%
prep() %>%
bake(new_data = NULL)
## # A tibble: 153 x 6
## Solar.R Wind Temp Month Day Ozone
## <int> <dbl> <int> <int> <int> <int>
## 1 190 7.4 67 5 1 41
## 2 118 8 72 5 2 36
## 3 149 12.6 74 5 3 12
## 4 313 11.5 62 5 4 18
## 5 159 14.3 56 5 5 NA
## 6 220 14.9 66 5 6 28
## 7 299 8.6 65 5 7 23
## 8 99 13.8 59 5 8 19
## 9 19 20.1 61 5 9 8
## 10 194 8.6 69 5 10 NA
## # … with 143 more rows
recipe(Ozone ~ ., data = airquality) %>%
step_bagimpute(all_predictors()) %>%
prep() %>%
bake(new_data = NULL)
## # A tibble: 153 x 6
## Solar.R Wind Temp Month Day Ozone
## <int> <dbl> <int> <int> <int> <int>
## 1 190 7.4 67 5 1 41
## 2 118 8 72 5 2 36
## 3 149 12.6 74 5 3 12
## 4 313 11.5 62 5 4 18
## 5 129 14.3 56 5 5 NA
## 6 244 14.9 66 5 6 28
## 7 299 8.6 65 5 7 23
## 8 99 13.8 59 5 8 19
## 9 19 20.1 61 5 9 8
## 10 194 8.6 69 5 10 NA
## # … with 143 more rows
These models are quite a bit more flexible than linear regression, and can potentially overfit. You can, however, control some of the parameters to the models through additional arguments (e.g., \(k\) for \(knn\), which defaults to 5). The benefit of these models is that they may provide better estimates of what the imputed value would have been, were it not missing, which may then improve model performance. The downside is that they are quite a bit more computationally intensive. Generally, you use recipes within processes like \(k\)-fold cross-validation, with the recipe being applied to each fold. In this case, a computationally expensive approach may significantly bog down hyperparameter tuning.
3.5.2 A few words of caution
Missing data is a highly complex topic. This section was meant to provide a basic overview of some of the options you can choose from when building a predictive model. None of these approaches, however, will “fix” data that are missing not at random (MNAR). Unfortunately, it is usually impossible to know if your data are MNAR, and we therefore assume that that are MAR, or missing at random conditional on the observed data. For example, if boys were more likely to have missing data on the outcome than girls, we could account for this by including a gender variable in the model, and the resulting data would be MAR.
If you have significant missing data, this section is surely incomplete. We recommended Little and Rubin (2002) previously, and there are a number of other good resources, including a chapter in Kuhn and Johnson (2019).