- Could you build us a statistical model?
- With pleasure. Can I look at your historical data?
- We have no data yet. But the model is still needed.
Proper linear models are those in which weights are assigned in such a way that the resulting linear combination predicts the quantity of interest in an optimal way. An example would be the usual linear regression, fitted by the method of least squares. [...]
Indecent linear models (improper linear models) - those in which the weights are determined optimally. For example, they are appointed on the basis of intuition, previous experience, or equal to one.
ImproperModel
- regression with unit weights (topic of this article)ProperModel
- logistic regression with the same independent variables.LogisticRegression
- logistic regression with all independent variablesRandomForest
- one of the most common modern algorithmsfemale
weight +1, and the age
variable weight -1. In addition, passengers with second and third class tickets, who have cabins in the hold, continue to run to the boats, so the variable pclass
is assigned a weight of -1. install.packages("pROC") install.packages("randomForest") library(pROC) library(randomForest)
titanic3.csv
data <- read.csv('~/habr/unit_weights/titanic3.csv')
survived
data$survived <- as.factor(data$survived)
data$name <- NULL data$ticket <- NULL data$cabin <- NULL data$home.dest <- NULL data$embarked <- NULL
data$boat <- NULL data$body <- NULL
data$age[is.na(data$age)] <- mean(data$age, na.rm=TRUE) data$fare[is.na(data$fare)] <- mean(data$fare, na.rm=TRUE)
data$female <- 0 data$female[which(data$sex == 'female')] <- 1 data$sex <- NULL
survived pclass age sibsp 0:809 Min. :1.000 Min. : 0.1667 Min. :0.0000 1:500 1st Qu.:2.000 1st Qu.:22.0000 1st Qu.:0.0000 Median :3.000 Median :29.8811 Median :0.0000 Mean :2.295 Mean :29.8811 Mean :0.4989 3rd Qu.:3.000 3rd Qu.:35.0000 3rd Qu.:1.0000 Max. :3.000 Max. :80.0000 Max. :8.0000 parch fare female Min. :0.000 Min. : 0.000 Min. :0.000 1st Qu.:0.000 1st Qu.: 7.896 1st Qu.:0.000 Median :0.000 Median : 14.454 Median :0.000 Mean :0.385 Mean : 33.295 Mean :0.356 3rd Qu.:0.000 3rd Qu.: 31.275 3rd Qu.:1.000 Max. :9.000 Max. :512.329 Max. :1.000
im.gini = NULL pm.gini = NULL lr.gini = NULL rf.gini = NULL set.seed(42) for (i in 1:1000) {
data$random_number <- runif(nrow(data),0,1) development <- data[ which(data$random_number > 0.3), ] holdout <- data[ which(data$random_number <= 0.3), ] development$random_number <- NULL holdout$random_number <- NULL
beta_pclass <- -1/sd(development$pclass) beta_age <- -1/sd(development$age ) beta_female <- 1/sd(development$female) im.score <- beta_pclass*holdout$pclass + beta_age*holdout$age + beta_female*holdout$female im.roc <- roc(holdout$survived, im.score) im.gini[i] <- 2*im.roc$auc-1
pm.model = glm(survived~pclass+age+female, family=binomial(logit), data=development) pm.score <- predict(pm.model, holdout, type="response") pm.roc <- roc(holdout$survived, pm.score) pm.gini[i] <- 2*pm.roc$auc-1
lr.model = glm(survived~., family=binomial(logit), data=development) lr.score <- predict(lr.model, holdout, type="response") lr.roc <- roc(holdout$survived, lr.score) lr.gini[i] <- 2*lr.roc$auc-1
rf.model <- randomForest(survived~., development) rf.score <- predict(rf.model, holdout, type = "prob") rf.roc <- roc(holdout$survived, rf.score[,1]) rf.gini[i] <- 2*rf.roc$auc-1 }
bpd<-data.frame(ImproperModel=im.gini, ProperModel=pm.gini, LogisticRegression=lr.gini, RandomForest=rf.gini) png('~/habr/unit_weights/auc_comparison.png', height=700, width=400, res=120, units='px') boxplot(bpd, las=2, ylab="Gini", ylim=c(0,1), par(mar=c(9,5,1,1)+ 0.1), col=c("red","green","royalblue2","brown")) dev.off() mean(im.gini) mean(pm.gini) mean(lr.gini) mean(rf.gini) mean(im.gini)/mean(rf.gini) mean(pm.gini)/mean(rf.gini) mean(lr.gini)/mean(rf.gini) mean(rf.gini)/mean(rf.gini)
Model | Gini | Percentage of best |
---|---|---|
ImproperModel | 0.639 | 90.4% |
ProperModel | 0.667 | 94.3% |
LogisticRegression | 0.679 | 96.0% |
RandomForest | 0.707 | 100% |
If k linearly independent variables x i (i = 1, ..., k) with zero mean value and unit variance are used to predict the variable y , which is also scaled to zero mean and unit variance, and the standardized values of the least squares regression coefficients β i (i = 1, ..., k) are evenly distributed over the interval [0.25, 0.75] , then when going to equal weights (0.5), the expectation of reducing the proportion of dispersion of the dependent variable explained by the model will be less than k / 96 . Losses are even less if x i are correlated with each other.
Source: https://habr.com/ru/post/272201/
All Articles