📜 ⬆️ ⬇️

R: processing missing values

Missing values ​​in the data - a common phenomenon in real problems. You need to know how to work with them effectively, if the goal is to reduce the error and build an accurate model. Let's consider different options for processing missing values ​​and their implementation.

Data set and preparation


We will use the BostonHousing dataset from the mlbench package to illustrate different approaches to handling missing values. Although there are no missing values ​​in the BostonHousing source data, I’ll enter them randomly. Thanks to this, we will be able to compare the calculated missing values ​​with the actual values ​​in order to evaluate the effectiveness of the approaches to data recovery. Let's start by importing data from the mlbench package and randomly mlbench missing values ​​(NA).
 #   data ("BostonHousing", package="mlbench") original <- BostonHousing #    #    set.seed(100) BostonHousing[sample(1:nrow(BostonHousing), 40), "rad"] <- NA BostonHousing[sample(1:nrow(BostonHousing), 40), "ptratio"] 

 #> crim zn indus chas nox rm age dis rad tax ptratio b lstat medv #> 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0 #> 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6 #> 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7 #> 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4 #> 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2 #> 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7 

The missing values ​​were entered. And although we know where they are, let's do a little check with mice::md.pattern .
 #    library(mice) md.pattern(BostonHousing) #      

 #> crim zn indus chas nox rm age dis tax b lstat medv rad ptratio #> 431 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 #> 35 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 #> 35 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 #> 5 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 #> 0 0 0 0 0 0 0 0 0 0 0 0 40 40 80 

Basically, there are four ways to handle missing values.

1. Delete data


If there is a relatively large amount of data in your set, where all the required classes are sufficiently represented in the training mode data, try deleting the data (rows) containing missing values ​​(or not taking missing values ​​into account when creating the model, for example, setting na.action=na.omit ). Make sure that after deleting the data you have:
  1. enough points so that the model does not lose credibility;
  2. no error appeared (i.e., disproportionality or absence of any class).

 #  lm(medv ~ ptratio + rad, data=BostonHousing, na.action=na.omit) 

2. Deleting a variable


If there are more missing values ​​in a particular variable than in the rest, and if deleting it, you can save a lot of data, I would suggest deleting this variable. Of course, if it is not really significant factor. In fact, this decision is to lose a variable or part of the data.

3. Evaluation of the average, median, fashion


Replacing the missing values ​​with an average, median, or fashion is a rough way to work with them. Depending on the situation, for example, if the variation of the data is small, or this variable has little effect on the output, such a rough approximation may be acceptable and will give satisfactory results.
 library(Hmisc) impute(BostonHousing$ptratio, mean) #   impute(BostonHousing$ptratio, median) #  impute(BostonHousing$ptratio, 20) #    #      BostonHousing$ptratio[is.na(BostonHousing$ptratio)] <- mean(BostonHousing$ptratio, na.rm = T) 

Let's calculate the accuracy in the case of replacement with the average:
 library(DMwR) actuals <- original$ptratio[is.na(BostonHousing$ptratio)] predicteds <- rep(mean(BostonHousing$ptratio, na.rm=T), length(actuals)) regr.eval(actuals, predicteds) 

 #> mae mse rmse mape #> 1.62324034 4.19306071 2.04769644 0.09545664 

4. Forecasting


Prediction is the most difficult method of replacing missing values. It includes the following approaches: kNN-score, rpart and mice.
')
4.1. kNN-score

DMwR :: knnImputation uses the k nearest neighbors method to replace missing values. Simply put, the kNN-score does the following. For each given one that needs to be replaced, k nearest points are determined based on the Euclidean distance, and their weighted (by distance) average is calculated.

The advantage is that you can replace all missing values ​​in all variables with a single function call. It takes as an argument the entire data set, and you can not even specify which variable you want to replace. However, when replacing it is necessary to prevent the output variable from being included in the calculation.
 library(DMwR) knnOutput <- knnImputation(BostonHousing[, !names(BostonHousing) %in% "medv"]) #  knn- anyNA(knnOutput) 

 #> FALSE 

Let's rate the accuracy:
 actuals <- original$ptratio[is.na(BostonHousing$ptratio)] predicteds <- knnOutput[is.na(BostonHousing$ptratio), "ptratio"] regr.eval(actuals, predicteds) 

 #> mae mse rmse mape #> 1.00188715 1.97910183 1.40680554 0.05859526 

The average absolute percent error (mape) improved by about 39% compared to the replacement with the average. Not bad.

4.2 rpart

The limitation of DMwR::knnImputation is that sometimes this function cannot be used if the values ​​of the factor variable are missing. Both rpart and mice are suitable for such a case. The advantage of rpart is that at least one variable that does not contain NA sufficient.

Now use rpart to replace the missing values ​​instead of kNN . In order to process a factor variable, you need to set method=class when calling rpart() . For numeric values, we will use method=anova . In this case, you also need to make sure that the output variable ( medv ) is not used in rpart training.
 library(rpart) class_mod <- rpart(rad ~ . - medv, data=BostonHousing[!is.na(BostonHousing$rad), ], method="class", na.action=na.omit) # .. rad -   anova_mod <- rpart(ptratio ~ . - medv, data=BostonHousing[!is.na(BostonHousing$ptratio), ], method="anova", na.action=na.omit) # .. ptratio -   rad_pred <- predict(class_mod, BostonHousing[is.na(BostonHousing$rad), ]) ptratio_pred <- predict(anova_mod, BostonHousing[is.na(BostonHousing$ptratio), ]) 

Calculate the accuracy for ptratio:
 actuals <- original$ptratio[is.na(BostonHousing$ptratio)] predicteds <- ptratio_pred regr.eval(actuals, predicteds) 

 #> mae mse rmse mape #> 0.71061673 0.99693845 0.99846805 0.04099908 

The average absolute percent error (mape) improved by another 30% compared with the kNN score. Very good.

Accuracy for rad:
 actuals <- original$rad[is.na(BostonHousing$rad)] predicteds <- as.numeric(colnames(rad_pred)[apply(rad_pred, 1, which.max)]) mean(actuals != predicteds) #     

 #> 0.25 

Mistake misclassification - 25%. Not bad for a factor variable!

4.3 mice

mice - short for Multivariate Imputation by Chained Equations (multidimensional estimation of chain equations) - R package, which provides complex functions for working with missing values. It uses a slightly unusual evaluation method in two steps: mice() for building the model and complete() for generating data. The mice(df) function creates several full copies of df, each with its own estimate of the missing data. The complete() function complete() returns one or several data sets, the default set will be the first. Let's see how to replace rad and ptratio:
 library(mice) miceMod <- mice(BostonHousing[, !names(BostonHousing) %in% "medv"], method="rf") #  mice     miceOutput <- complete(miceMod) #    anyNA(miceOutput) 

 #> FALSE 

Calculate the accuracy of ptratio:
 actuals <- original$ptratio[is.na(BostonHousing$ptratio)] predicteds <- miceOutput[is.na(BostonHousing$ptratio), "ptratio"] regr.eval(actuals, predicteds) 

 #> mae mse rmse mape #> 0.36500000 0.78100000 0.88374204 0.02121326 

The average absolute percent error (mape) improved by another 48% compared to rpart. Fine!

Calculate the accuracy for rad:
 actuals <- original$rad[is.na(BostonHousing$rad)] predicteds <- miceOutput[is.na(BostonHousing$rad), "rad"] mean(actuals != predicteds) #     

 #> 0.15 

Incorrect classification error was reduced to 15%, i.e. 6 of 40 observations. This is a significant improvement over 25% for rpart.

Although it is generally clear how good each method is, it is not enough to say for sure which one is better or worse. But they all definitely deserve your attention when you need to solve the problem of replacing missing values.

Source: https://habr.com/ru/post/283168/


All Articles