The steady beauty of indecent models

- Could you build us a statistical model?
- With pleasure. Can I look at your historical data?
- We have no data yet. But the model is still needed.

Familiar dialogue, is not it? Further, there are two possible scenarios:

A. “Then come when data is available.” The option will not be considered as trivial.
B. “Tell us which factors do you think are most important.” The rest of the article is about this.

Under the cut there is a story about what an improper model is, why their beauty is stable and what it costs. All on the example of the long-suffering data on the survival of the passengers of the Titanic.

Where does this strange name come from

From the article The Robust Beauty of Improving Linear Models in Decision Making of comrade Robyn M. Dawes, who in turn refers to the work of predecessors.
')

Proper linear models are those in which weights are assigned in such a way that the resulting linear combination predicts the quantity of interest in an optimal way. An example would be the usual linear regression, fitted by the method of least squares. [...]

Indecent linear models (improper linear models) - those in which the weights are determined optimally. For example, they are appointed on the basis of intuition, previous experience, or equal to one.

A typical linear model looks like this:

ŷ = β ₀ + β ₁ x ₁ + β ₂ x ₂ +… + β _m x x _m

Here is the estimate of the dependent variable, β ₀ is the free term, β ₁ ... β _m are the regression coefficients, x ₁ ... x _m are independent variables.

Due to the lack of data to determine the values of β ₁ ... β _m, we will act voluntarily: assign them the values of +1 (positive effect), -1 (negative effect) and 0 (no effect).

Far-fetched example

Typical examples of statistical models that were discussed at the beginning of the article are scoring cards for consumer lending (they predict the likelihood of a client’s deep delay or non-return of debt), response models (they predict the likelihood that a client will respond to an offer to buy or subscribe to product), outflow models (predict the likelihood that a customer will go to a competitor), etc.

This article will use a simpler data set, most likely already familiar to all interested in the topic of machine learning for readers: the list of passengers of the Titanic. Earlier on Habré already looked at him:

Competitors

ImproperModel - regression with unit weights (topic of this article)
ProperModel - logistic regression with the same independent variables.
LogisticRegression - logistic regression with all independent variables
RandomForest - one of the most common modern algorithms

How will we compare

The passenger list will be divided into two parts, on one we will fit the model, on the other to check the quality of its work.

There are no miracles and absolutely no knowledgeable model can be built without any knowledge. In this case, such knowledge would be gallantry - women and children were let forward when boating. Based on this prior knowledge for the regression with unit weights, we assign the variable female weight +1, and the age variable weight -1. In addition, passengers with second and third class tickets, who have cabins in the hold, continue to run to the boats, so the variable pclass is assigned a weight of -1.

In order not to get up two times, we will immediately normalize it - divide by the standard deviation in the training set. This is necessary to bring the independent variables to a comparable scale. It was possible to divide by the range and get a similar result.

Please note that in the single weights model we did not use information about who survived and who drowned , only the standard deviation of independent variables for scaling.

Repeat the experiment a thousand times and look at the results. The quality of the model will be measured using the Gini coefficient ( G1 ).

R code with step by step comments

 install.packages("pROC") install.packages("randomForest") library(pROC) library(randomForest)

Read the source data is also available here: titanic3.csv

 data <- read.csv('~/habr/unit_weights/titanic3.csv')

Dependentd variable survived

 data$survived <- as.factor(data$survived)

Eliminate variables that have too many different values for normal analysis.

 data$name <- NULL data$ticket <- NULL data$cabin <- NULL data$home.dest <- NULL data$embarked <- NULL

We do not use variables from the future.

 data$boat <- NULL data$body <- NULL

Replace missing values with average

 data$age[is.na(data$age)] <- mean(data$age, na.rm=TRUE) data$fare[is.na(data$fare)] <- mean(data$fare, na.rm=TRUE)

Transform the floor into an indicator variable

 data$female <- 0 data$female[which(data$sex == 'female')] <- 1 data$sex <- NULL

We look at what is left:

  survived pclass age sibsp 0:809 Min. :1.000 Min. : 0.1667 Min. :0.0000 1:500 1st Qu.:2.000 1st Qu.:22.0000 1st Qu.:0.0000 Median :3.000 Median :29.8811 Median :0.0000 Mean :2.295 Mean :29.8811 Mean :0.4989 3rd Qu.:3.000 3rd Qu.:35.0000 3rd Qu.:1.0000 Max. :3.000 Max. :80.0000 Max. :8.0000 parch fare female Min. :0.000 Min. : 0.000 Min. :0.000 1st Qu.:0.000 1st Qu.: 7.896 1st Qu.:0.000 Median :0.000 Median : 14.454 Median :0.000 Mean :0.385 Mean : 33.295 Mean :0.356 3rd Qu.:0.000 3rd Qu.: 31.275 3rd Qu.:1.000 Max. :9.000 Max. :512.329 Max. :1.000

Repeat the experiment a thousand times.

 im.gini = NULL pm.gini = NULL lr.gini = NULL rf.gini = NULL set.seed(42) for (i in 1:1000) {

We divide passengers into two samples - 70% are sent to the training sample, for the remaining 30% we will measure the quality of the model.

  data$random_number <- runif(nrow(data),0,1) development <- data[ which(data$random_number > 0.3), ] holdout <- data[ which(data$random_number <= 0.3), ] development$random_number <- NULL holdout$random_number <- NULL

Single Scale Model

  beta_pclass <- -1/sd(development$pclass) beta_age <- -1/sd(development$age ) beta_female <- 1/sd(development$female) im.score <- beta_pclass*holdout$pclass + beta_age*holdout$age + beta_female*holdout$female im.roc <- roc(holdout$survived, im.score) im.gini[i] <- 2*im.roc$auc-1

The usual model is logistic regression with the same independent variables.

  pm.model = glm(survived~pclass+age+female, family=binomial(logit), data=development) pm.score <- predict(pm.model, holdout, type="response") pm.roc <- roc(holdout$survived, pm.score) pm.gini[i] <- 2*pm.roc$auc-1

Logistic regression with all variables

  lr.model = glm(survived~., family=binomial(logit), data=development) lr.score <- predict(lr.model, holdout, type="response") lr.roc <- roc(holdout$survived, lr.score) lr.gini[i] <- 2*lr.roc$auc-1

Everyone's favorite (and not without reason) RandomForest

  rf.model <- randomForest(survived~., development) rf.score <- predict(rf.model, holdout, type = "prob") rf.roc <- roc(holdout$survived, rf.score[,1]) rf.gini[i] <- 2*rf.roc$auc-1 }

Display the results

 bpd<-data.frame(ImproperModel=im.gini, ProperModel=pm.gini, LogisticRegression=lr.gini, RandomForest=rf.gini) png('~/habr/unit_weights/auc_comparison.png', height=700, width=400, res=120, units='px') boxplot(bpd, las=2, ylab="Gini", ylim=c(0,1), par(mar=c(9,5,1,1)+ 0.1), col=c("red","green","royalblue2","brown")) dev.off() mean(im.gini) mean(pm.gini) mean(lr.gini) mean(rf.gini) mean(im.gini)/mean(rf.gini) mean(pm.gini)/mean(rf.gini) mean(lr.gini)/mean(rf.gini) mean(rf.gini)/mean(rf.gini)

results

Model	Gini	Percentage of best
ImproperModel	0.639	90.4%
ProperModel	0.667	94.3%
LogisticRegression	0.679	96.0%
RandomForest	0.707	100%

Far-reaching conclusions

A pessimist will say that the model with unit weights ranked last in this comparison, and this is true. But it is also true that she gained 90% of the best result without using historical data , and lagged behind the usual logistic regression with the same independent variables by only 4% .

Why it happens

In the article with the unusual name Estimating Coefficients in Linear Models: its author (Howard Weiner) gives the following theorem of equal weights:

If k linearly independent variables x _i (i = 1, ..., k) with zero mean value and unit variance are used to predict the variable y , which is also scaled to zero mean and unit variance, and the standardized values of the least squares regression coefficients β _i (i = 1, ..., k) are evenly distributed over the interval [0.25, 0.75] , then when going to equal weights (0.5), the expectation of reducing the proportion of dispersion of the dependent variable explained by the model will be less than k / 96 . Losses are even less if x _{i are} correlated with each other.

In the example above, the regression is logistic, but the effect is still visible.

Further in the same article, the author notes that the models with unit weights are stable in particular because they are not affected by emissions in the training set and cannot be overtrained.

Source: https://habr.com/ru/post/272201/

All Articles