
Quite often there are incomplete data sets in which some variables are not defined. In the R language, the contents of such variables are specified as “Not Available” —or abbreviated as NA. Accordingly, the question arises as to how to deal with uncertain values: should they be ignored or corrected in any way?
Let us examine some aspects of this problem using the example of a practically classic data set taken from the
Titanic: Machine Learning from Disaster competition. The necessary data can be downloaded manually from the Kaggle website or using the same R tools (running under Linux):
Hidden textdownload.file("https://bitbucket.org/kailexx/fixnas/raw/ae65f7939974e709f10aa50c96c368120487a7f2/train.csv", destfile="train.csv", method= "wget") train <- read.csv("train.csv", na.strings = c(NA, ""))
Let's see what the contents of the file are:
Hidden text str(train) 'data.frame': 891 obs. of 12 variables: $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 416 581 ... $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... $ Ticket : Factor w/ 681 levels "110152","110413",..: 525 596 662 50 473 276 86 396 345 133 ... $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... $ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ... $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ... sum(is.na(train)) [1] 866
In the set there are 891 lines and as many as 866 undefined variables, which is clearly shown by the following graph:

Two easy solutions
The first solution is for the lazy: there’s practically nothing to do, since many functions in R have the argument na.rm, setting its value to TRUE, we will make R just delete NA before doing something with this. Other functions have the na.action parameter, which can have the following values:
na.fail - return an error if the data contains NA.
na.omit, na.exclude - delete all variables with the value NA.
na.pass - leave the data as it is.
Actually, if we need to calculate the median age of the passengers of the Titanic, then we can do this:
median(train$Age, na.rm=T) [1] 28
The second solution, in general, follows from the first - to clear the data of all uncertain values in advance and not think about them any more:
train.nopain <- na.omit(train) nrow(train.nopain) [1] 183 sum(is.na(train.nopain)) [1] 0 median(train.nopain$Age) [1] 36
As you can see, 708 lines have disappeared with NA. It is not scary if we are not particularly interested in the result, but it is worth considering other options.
We use the means at hand
An acceptable option is to replace the NA with some preselected value; usually it is mean or median. To do this, we write a simple function:
simpleFix <- function(x, imputeFn=mean){ return(ifelse(is.na(x), imputeFn(x, na.rm=TRUE), x)) } train.median <- train nas.idx <- which(is.na(train.median$Age)) train.median$Age <- simpleFix(train.median$Age, median)
head(train.median$Age[nas.idx]) [1] 28 28 28 28 28 28
All undefined Age values are now replaced by the median. Sometimes it makes sense to calculate the mean or median with some condition. For example, if we adjust the NA of the Age field of a male passenger who occupies a first-class cabin, then the average should be calculated taking this circumstance into account:
fixAge <- function(tdf, imputeFn=mean) { tdf$Age[is.na(tdf$Age)] <- sapply(which(is.na(tdf$Age)), function(i) imputeFn(tdf$Age[tdf$Pclass == tdf$Pclass[i] & tdf$Sex == tdf$Sex[i]], na.rm=T)) return(tdf) } nas.idx <- which(is.na(train$Age)) train.cond <- fixAge(train, median)
head(train.cond$Age[nas.idx]) [1] 25.0 30.0 21.5 25.0 21.5 25.0
It is worth noting that replacing uncertain values with the mean or median leads to a decrease in the variance of the data. This method can be slightly improved (and complicated) by using
singular value decomposition (SVD) and
approximation by a matrix of lower rank . In this case, the undefined values are first replaced by the average, then the original matrix is approximated by a matrix of lower rank, and the undefined values that have been replaced by the average are replaced by the values taken from the decomposition. The procedure of approximation is repeated several times; for this we need to pre-set the approximation rank and the number of steps. Consider a somewhat
spherical example with a 10x10 matrix of random numbers with 15 undefined values.
k <- 6 # n.iters <- 10 nrows <- 10 set.seed(100500) train.mat <- runif(nrows * nrows) train.mat[sample(1:length(train.mat), 15)] <- NA train.mat <- matrix(train.mat, nrows) # 15 NA nas.idx <- which(is.na(train.mat)) train.svd <- train.mat train.svd <- apply(train.svd, 2, simpleFix) # NA for (i in 1:n.iters){ s <- svd(train.svd, k, k) train.svd[nas.idx] <- (s$u %*% diag(s$d[1:k], nrow=k, ncol=k) %*% t(s$v))[nas.idx] }
head(train.svd[nas.idx]) [1] 0.3020229 0.4475467 0.3114711 0.7161445 0.4379184 0.6734933
')
What to do with non-numeric values
If you carefully examine the
Embarked
field, then there are 2 undefined values. One of the processing options is based on
sampling :
fixSample <- function(x) { x[is.na(x)] <- sample(x, sum(is.na(x)), replace = T) return(x) } set.seed(111) nas.idx <- which(is.na(train.cond$Embarked)) train.cond$Embarked <- fixSample(train.cond$Embarked)
train.cond$Embarked[nas.idx] [1] SC Levels: CQS sum(is.na(train.cond$Embarked)) [1] 0
Actually, sampling is a fairly common method, and is very popular with modern statisticians.
More universal approach
R has developed tools for building various models - from simple linear regression to the 3B technique (bagging, boosting, blending). Let's write a function that uses RandomForest to calculate undefined variables; we also remove some PassengerId, Name, Ticket, Cabin variables (I, frankly, have not found a worthy use for them, and there are so many undefined values in the Cabin field that the “precomputation” makes no sense). If the randomForest package is not installed, then the
install.packages("randomForest")
R command will install it from CRAN.
fixNA <- function(y, x) { require(randomForest) fixer <- randomForest(x[!is.na(y), ], y[!is.na(y)]) y[is.na(y)] <- predict(fixer, x[is.na(y), ]) return(y) } set.seed(111) train.rf <- subset(train, select=-c(PassengerId, Name, Ticket, Cabin)) ageNA.idx <- which(is.na(train.rf$Age)) embNA.idx <- which(is.na(train.rf$Embarked)) sum(is.na(train.rf)) train.rf$Age <- fixNA(train.rf$Age, cbind(train.rf$Pclass, train.rf$Sex, train.rf$Parch)) train.rf$Embarked <- fixNA(train.rf$Embarked, cbind(train.rf$Pclass, train.rf$Sex, train.rf$Parch))
head(train.rf$Age[ageNA.idx]) [1] 29.65873 31.67546 26.64918 29.65873 26.64918 29.65873 head(train.rf$Embarked[embNA.idx]) [1] SS sum(is.na(train.rf)) [1] 0
In this form, the data is almost ready for a more detailed analysis.
Conclusion
There are many approaches and methods for working with incomplete data - the problem is far from trivial. In CRAN, in particular, there are specialized packages for handling NA values (for example,
Amelia
,
imputation
), and packages that among other things allow you to manipulate NA (
impute
functions in the
Hmisc
package,
rfImpute
in
randomForest
). Separate consideration is required by the method based on the
Expectation Maximization algorithm.
References and literature
1.
Working with missing data2.
Data Imputation3. Missing Data & Small-Area Estimation: Modern Analytical Equipment for the Survey Statistician. Nicholas T. Longford.