3.5 Missing data

If we look closely at our statewide testing data, we will see that there is a considerable amount of missingness. In fact, every row of the data frame has at least one value that is missing. The amount to which missing data impacts your work varies by field, but in most fields you’re likely to run into situations where you have to handle missing data in some way. The purpose of this section is to discuss a few approaches (as implemented through the {recipes} package) to handling missingness for predictive modeling purposes. Note that this is not a comprehensive discussion on the topic (for which, we recommend Little and Rubin (2002)), but is instead an applied discussion of what you can do. As with many aspects of data analysis, generally, there is no single approach that will always work best, and it’s worth trying a few different approaches in your model development to see how different choices impact your model performance.

There are three basic ways of handling missing data:

Omit rows of the data frame that include missing values
Encode or Impute the missing data
Ignore the missing data and estimate from the available data

The last option is not always feasible and will depend the modeling framework you’re working within. Some estimation procedures can also lend themselves to efficient handling of missing data (for example, imputation via the posterior distribution with Bayesian estimation). In this section, we’ll mostly focus on the first three. Additionally, we will only really be concerned with missingness on the predictor variables here, rather than the outcome. Generally, missing data on the predictors is a much more difficult problem than missingness on the outcome, because most models assume you have complete data across your predictors.

3.5.1 Missing data via {recipes}

3.5.1.1 Omission

We can omit missing data with step_naomit. This will remove any row that has any missing data. Let’s see how this impacts our data, working with our same recipe we finished up with in the Creating a recipe section. I’ve placed the recipe here again so we don’t have to go back to remind ourselves what we did previously.

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_mutate(lang_cd = factor(ifelse(is.na(lang_cd), "E", lang_cd)),
              tst_dt = lubridate::mdy_hms(tst_dt)) %>% 
  step_zv(all_predictors())

na_omit_data <- rec %>% 
  step_naomit(all_predictors()) %>% 
  step_dummy(all_nominal()) %>% 
  prep() %>% 
  bake(new_data = NULL)

nrow(na_omit_data)

## [1] 561

As can be seen above, when we omit any row with missing data we end up with only 561 rows out of the original 2841 rows in the training data (or approximately 20% of the original data). This level of data omission is highly likely to introduce systematic biases into your model prediction. Generally, step_naomit should only be used when developing preliminary models, where you’re just trying to get code to run. When you get to the point where you’re actually trying to improve performance, you should consider alternative means

3.5.1.2 Encoding and simple imputation

Encoding missing data is similar to imputation. In imputation, we replace the missing value with something we thing could have reasonably been the real value, if it were observed. When we encode missing data we are creating values that will be included in the modeling process. For example, with categorical variables, we could replace the missingess with a “missing” level, which would then get its own dummy code (if we were using dummy coding to encode the categorical variables).

I mentioned in the Creating a recipe section that we were getting warnings but I was omitting them in the text. The reason is because some of these columns have missing data. If we want to avoid this warning, we have to add an additional step to our recipe to encode the missing data in the categorical variables. This step is called step_unknown and it replaces missing values with "unknown". Let’s do this for all categorical variables, and omit any rows that are missing on numeric columns.

na_encode_data <- rec %>% 
  step_unknown(all_nominal()) %>% 
  step_naomit(all_predictors()) %>% 
  step_dummy(all_nominal()) %>% 
  prep() %>% 
  bake(new_data = NULL)

nrow(na_encode_data)

## [1] 2784

Notice in the above that when I call step_naomit I state that it should be applied to all_predictors because I’ve already encoded the nominal predictors in the previous step. This approach allows us to capture 98% of the original data. And as a bonus, we’ve removed ourselves of the warnings (note - we might also want to apply step_novel for any future data that had levels outside of our training data - see Handling new levels).

Just a slight step up in complexity from omission of rows with missing data is to impute them with sample descriptive statistics, such as the mean or the median. Generally, I’ve found median imputation works better than mean imputation, but that could be related to the types of data I work with most frequently. Let’s switch datasets so we can see what’s happening more directly.

Let’s look at the airquality dataset, which ships with R.

head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

As we can see, Solar.R is missing for observations 5 and 6. Let’s compute the sample mean and median for this column.

mean(airquality$Solar.R, na.rm = TRUE)

## [1] 185.9315

median(airquality$Solar.R, na.rm = TRUE)

## [1] 205

If we use mean or median imputation, we just replace the missing values with these sample statistics. Let’s do this with {recipes}, assuming we’ll be fitting a model where Ozone is the outcome, predicted by all other variables in the dataset.

recipe(Ozone ~ ., data = airquality) %>% 
  step_meanimpute(all_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL)

## # A tibble: 153 x 6
##    Solar.R  Wind  Temp Month   Day Ozone
##      <int> <dbl> <int> <int> <int> <int>
##  1     190   7.4    67     5     1    41
##  2     118   8      72     5     2    36
##  3     149  12.6    74     5     3    12
##  4     313  11.5    62     5     4    18
##  5     186  14.3    56     5     5    NA
##  6     186  14.9    66     5     6    28
##  7     299   8.6    65     5     7    23
##  8      99  13.8    59     5     8    19
##  9      19  20.1    61     5     9     8
## 10     194   8.6    69     5    10    NA
## # … with 143 more rows

As we can see, the value \(186\) has been imputed for rows 5 and 6, which is the integer version of the sample mean (an integer was imputed because the column was already an integer, and not a double).

Let’s try the same thing with median imputation

recipe(Ozone ~ ., data = airquality) %>% 
  step_medianimpute(all_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL)

## # A tibble: 153 x 6
##    Solar.R  Wind  Temp Month   Day Ozone
##      <int> <dbl> <int> <int> <int> <int>
##  1     190   7.4    67     5     1    41
##  2     118   8      72     5     2    36
##  3     149  12.6    74     5     3    12
##  4     313  11.5    62     5     4    18
##  5     205  14.3    56     5     5    NA
##  6     205  14.9    66     5     6    28
##  7     299   8.6    65     5     7    23
##  8      99  13.8    59     5     8    19
##  9      19  20.1    61     5     9     8
## 10     194   8.6    69     5    10    NA
## # … with 143 more rows

And as we would expect, the missingness has now been replaced with values of \(205\).

Sometimes you have time series data, or there is a date variable in the dataset that accounts for a meaningful proportion of the variance. In these cases, you might consider step_rollimpute, which provides a conditional median imputation based on time, and you can set the size of the window from which to calculate the median. In still other cases it may make sense to just impute with the lowest observed value (i.e., assume a very small amount of the predictor), which can be accomplished with step_lowerimpute.

These simple imputation techniques are fine to use when developing models. However, it’s an area that may be worth returning to as you start to refine your model to see if you can improve performance.

3.5.1.3 Modeling the missingness

Another alternative for imputation is to fit a statistical model with the column you want to impute modeled as the outcome, with all other columns (minus the actual outcome) predicting it. We then use that model for the imputation. Let’s first consider a linear regression model. We’ll fit the same model we specified in our recipe, using the airquality data.

m <- lm(Solar.R ~ ., data = airquality[ ,-1])
summary(m)

## 
## Call:
## lm(formula = Solar.R ~ ., data = airquality[, -1])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -182.945  -67.348    5.295   73.781  170.068 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14.5960    84.4424  -0.173 0.863016    
## Wind          2.1661     2.2633   0.957 0.340171    
## Temp          3.7023     0.9276   3.991 0.000105 ***
## Month       -13.2640     5.4525  -2.433 0.016242 *  
## Day          -1.0631     0.8125  -1.308 0.192875    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 85.14 on 141 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.131,  Adjusted R-squared:  0.1063 
## F-statistic: 5.313 on 4 and 141 DF,  p-value: 0.0005145

Notice that I’ve dropped the first column here, which is Ozone, our actual outcome. The model above has been fit using the equivalent of step_naomit, otherwise known as listwise deletion, where any row with any missing data is removed.

We can now use the coefficients from this model to impute the missing values in Solar.R. For example, row 6 in the dataset had a missing value on Solar.R and following values for all other variables

row6 <- data.frame(Wind = 14.9, 
                   Temp = 66, 
                   Month = 5, 
                   Day = 6)

Using our model, we would predict the following score for this missing value

predict(m, newdata = row6)

##        1 
## 189.3325

Let’s try this using {recipes}.

recipe(Ozone ~ ., data = airquality) %>% 
  step_impute_linear(all_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL)

## # A tibble: 153 x 6
##    Solar.R  Wind  Temp Month   Day Ozone
##      <int> <dbl> <int> <int> <int> <int>
##  1     190   7.4    67     5     1    41
##  2     118   8      72     5     2    36
##  3     149  12.6    74     5     3    12
##  4     313  11.5    62     5     4    18
##  5     152  14.3    56     5     5    NA
##  6     189  14.9    66     5     6    28
##  7     299   8.6    65     5     7    23
##  8      99  13.8    59     5     8    19
##  9      19  20.1    61     5     9     8
## 10     194   8.6    69     5    10    NA
## # … with 143 more rows

And we see two important things here. First, row 6 for Solar.R is indeed as we expected it to be (albeit, in integer form). Second, the imputed values for rows 5 and 6 are now different, which is the first time we’ve seen this via imputation.

The same basic approach can be used for essentially any statistical model. The {recipes} package has currently implemented linear imputation (as above), \(k\)-nearest neighbor imputation, and bagged imputation (via bagged trees). Let’s see how rows 5 and 6 differ with these approaches.

recipe(Ozone ~ ., data = airquality) %>% 
  step_knnimpute(all_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL)

## # A tibble: 153 x 6
##    Solar.R  Wind  Temp Month   Day Ozone
##      <int> <dbl> <int> <int> <int> <int>
##  1     190   7.4    67     5     1    41
##  2     118   8      72     5     2    36
##  3     149  12.6    74     5     3    12
##  4     313  11.5    62     5     4    18
##  5     159  14.3    56     5     5    NA
##  6     220  14.9    66     5     6    28
##  7     299   8.6    65     5     7    23
##  8      99  13.8    59     5     8    19
##  9      19  20.1    61     5     9     8
## 10     194   8.6    69     5    10    NA
## # … with 143 more rows

recipe(Ozone ~ ., data = airquality) %>% 
  step_bagimpute(all_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL)

## # A tibble: 153 x 6
##    Solar.R  Wind  Temp Month   Day Ozone
##      <int> <dbl> <int> <int> <int> <int>
##  1     190   7.4    67     5     1    41
##  2     118   8      72     5     2    36
##  3     149  12.6    74     5     3    12
##  4     313  11.5    62     5     4    18
##  5     129  14.3    56     5     5    NA
##  6     244  14.9    66     5     6    28
##  7     299   8.6    65     5     7    23
##  8      99  13.8    59     5     8    19
##  9      19  20.1    61     5     9     8
## 10     194   8.6    69     5    10    NA
## # … with 143 more rows

These models are quite a bit more flexible than linear regression, and can potentially overfit. You can, however, control some of the parameters to the models through additional arguments (e.g., \(k\) for \(knn\), which defaults to 5). The benefit of these models is that they may provide better estimates of what the imputed value would have been, were it not missing, which may then improve model performance. The downside is that they are quite a bit more computationally intensive. Generally, you use recipes within processes like \(k\)-fold cross-validation, with the recipe being applied to each fold. In this case, a computationally expensive approach may significantly bog down hyperparameter tuning.

3.5.2 A few words of caution

Missing data is a highly complex topic. This section was meant to provide a basic overview of some of the options you can choose from when building a predictive model. None of these approaches, however, will “fix” data that are missing not at random (MNAR). Unfortunately, it is usually impossible to know if your data are MNAR, and we therefore assume that that are MAR, or missing at random conditional on the observed data. For example, if boys were more likely to have missing data on the outcome than girls, we could account for this by including a gender variable in the model, and the resulting data would be MAR.

If you have significant missing data, this section is surely incomplete. We recommended Little and Rubin (2002) previously, and there are a number of other good resources, including a chapter in Kuhn and Johnson (2019).