Quite often there are incomplete data sets in which some variables are not defined. In the R language, the contents of such variables are specified as “Not Available” —or abbreviated as NA. Accordingly, the question arises as to how to deal with uncertain values: should they be ignored or corrected in any way?download.file("https://bitbucket.org/kailexx/fixnas/raw/ae65f7939974e709f10aa50c96c368120487a7f2/train.csv", destfile="train.csv", method= "wget") train <- read.csv("train.csv", na.strings = c(NA, "")) str(train) 'data.frame': 891 obs. of 12 variables: $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 416 581 ... $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... $ Ticket : Factor w/ 681 levels "110152","110413",..: 525 596 662 50 473 276 86 396 345 133 ... $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... $ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ... $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ... sum(is.na(train)) [1] 866 
median(train$Age, na.rm=T) [1] 28 train.nopain <- na.omit(train) nrow(train.nopain) [1] 183 sum(is.na(train.nopain)) [1] 0 median(train.nopain$Age) [1] 36 simpleFix <- function(x, imputeFn=mean){ return(ifelse(is.na(x), imputeFn(x, na.rm=TRUE), x)) } train.median <- train nas.idx <- which(is.na(train.median$Age)) train.median$Age <- simpleFix(train.median$Age, median) head(train.median$Age[nas.idx]) [1] 28 28 28 28 28 28 fixAge <- function(tdf, imputeFn=mean) { tdf$Age[is.na(tdf$Age)] <- sapply(which(is.na(tdf$Age)), function(i) imputeFn(tdf$Age[tdf$Pclass == tdf$Pclass[i] & tdf$Sex == tdf$Sex[i]], na.rm=T)) return(tdf) } nas.idx <- which(is.na(train$Age)) train.cond <- fixAge(train, median) head(train.cond$Age[nas.idx]) [1] 25.0 30.0 21.5 25.0 21.5 25.0 k <- 6 # n.iters <- 10 nrows <- 10 set.seed(100500) train.mat <- runif(nrows * nrows) train.mat[sample(1:length(train.mat), 15)] <- NA train.mat <- matrix(train.mat, nrows) # 15 NA nas.idx <- which(is.na(train.mat)) train.svd <- train.mat train.svd <- apply(train.svd, 2, simpleFix) # NA for (i in 1:n.iters){ s <- svd(train.svd, k, k) train.svd[nas.idx] <- (s$u %*% diag(s$d[1:k], nrow=k, ncol=k) %*% t(s$v))[nas.idx] } head(train.svd[nas.idx]) [1] 0.3020229 0.4475467 0.3114711 0.7161445 0.4379184 0.6734933 Embarked field, then there are 2 undefined values. One of the processing options is based on sampling : fixSample <- function(x) { x[is.na(x)] <- sample(x, sum(is.na(x)), replace = T) return(x) } set.seed(111) nas.idx <- which(is.na(train.cond$Embarked)) train.cond$Embarked <- fixSample(train.cond$Embarked) train.cond$Embarked[nas.idx] [1] SC Levels: CQS sum(is.na(train.cond$Embarked)) [1] 0 install.packages("randomForest") R command will install it from CRAN. fixNA <- function(y, x) { require(randomForest) fixer <- randomForest(x[!is.na(y), ], y[!is.na(y)]) y[is.na(y)] <- predict(fixer, x[is.na(y), ]) return(y) } set.seed(111) train.rf <- subset(train, select=-c(PassengerId, Name, Ticket, Cabin)) ageNA.idx <- which(is.na(train.rf$Age)) embNA.idx <- which(is.na(train.rf$Embarked)) sum(is.na(train.rf)) train.rf$Age <- fixNA(train.rf$Age, cbind(train.rf$Pclass, train.rf$Sex, train.rf$Parch)) train.rf$Embarked <- fixNA(train.rf$Embarked, cbind(train.rf$Pclass, train.rf$Sex, train.rf$Parch)) head(train.rf$Age[ageNA.idx]) [1] 29.65873 31.67546 26.64918 29.65873 26.64918 29.65873 head(train.rf$Embarked[embNA.idx]) [1] SS sum(is.na(train.rf)) [1] 0 Amelia , imputation ), and packages that among other things allow you to manipulate NA ( impute functions in the Hmisc package, rfImpute in randomForest ). Separate consideration is required by the method based on the Expectation Maximization algorithm.Source: https://habr.com/ru/post/207750/
All Articles