Statistical tests in R. Part 1: Binary classification

Good day. I want to share my knowledge of working with statistics in R.
Many of us have to deal with various data at work and in everyday life. It is not so difficult to process and analyze them qualitatively and correctly. In this series of articles, I will show the application of some statistical tests.

Interested? Welcome under cat .

Part 2: Quality Data Tests
Part 3: Quantitative Tests

I want to apologize in advance that I often use English terms, as well as for their possible incorrect translation.
')

Binary classification, qualitative data

The first article is devoted to such an interesting test, as binary classification. This is testing, which consists of checking objects for the presence of some quality. For example, diagnostic tests (everyone probably did the mantle) or detection of signals in radar.

We will analyze by example. All sample files can be downloaded at the end of the article. Imagine that you came up with an algorithm that determines the presence of a person in a photo. It seems everything works, you were delighted, but early. After all, you need to evaluate the quality of your algorithm. Here we need to use our test. We will not be wondering about the required sample size for testing. Let's say that you took 30 photos, personally recorded in the Excel file, whether people were on them or not, and then drove through your algorithm. As a result, we got the following table:

We save it immediately in csv, so as not to strain with reading xls (this is possible in R, but not out of the box).

Now a little theory. Based on the test results, the following table is compiled.

Important parameters

A priori probability:
Sensitivity . P (T + | H +). The likelihood that a person will be detected.
Se = 14/16
Specificity , in other tests, is often referred to as power. P (T- | H-). The probability that in the absence of a person, the test result is negative.
Sp = 10/14

Likelihood quotient. An important feature to evaluate a test. Consists of 2 values.

In the literature, a test is considered good if LR + and LR- are greater than 3 (refers to medical tests).

A posteriori probability: positive and negative predictive value. The probability that the test result (positive or negative) is true.
PV + = 14/18
PV- = 10/12

There are also such concepts as the error of the first kind (1 - Se) and the error of the second kind (1 - Sp). Essentially equivalent are sensitivity and specificity.

Now in r

First, download the data.

tab<-read.csv(file="data1.csv", header=TRUE, sep=",", dec=".") attach(tab) Test <- factor(Test, levels=c("0","1"), labels=c("T-","T+"), ordered=T) Human <-factor(Human, levels=c("0","1"), labels=c("H-","H+"), ordered=T)

In the last two lines, we assigned labels instead of 0 and 1. It is necessary to do this, because otherwise R will work with our data as with numbers.

The table can be displayed as follows:

 addmargins(table(Test, Human))

This table is not bad, but there is a prettyR package that will do almost everything for us. In order to install a package, in the default R gui, you need to click install packages in packages and type the name of the package.

Use the library. For a change, we will display the result in html, because my RStudio tables are displayed a little incorrectly (if you know how to fix it - write).

 library(prettyR) test<-calculate.xtab(Test, Human, varnames=c("Test","Human","T+","T-","H+","H-")) print(test, html=T)

Let us examine what is written there.

Thus, we obtain quantitative characteristics of the operation of our algorithm. Note that LR +, which is indicated on the table as the odds ratio is more than 3. Also pay attention to the parameters described above. As a rule, the main interest should be PV + and Se, since false alarm is an additional cost, and non-detection can lead to fatal consequences.

Binary classification, quantitative data

And what if our data are quantitative? This may be, for example, the parameter by which the previous algorithm decides (say, the number of pixels of skin color). For fun, let's look at the work of the algorithm that blocks spammers.

You are the creator of a new social network, and are trying to fight spammers. Spammers send a large number of letters, so the simplest thing is to block them after exceeding a certain message threshold. Just how to choose it? We take a sample from again 30 users. We find out whether they are robots, read the number of messages and get:

Just a bit of theory. After selecting the threshold, we divide the sample into 2 parts and get the table from the 1st example. Naturally, our task is to choose the best threshold. There is no single algorithm, since in each real example, sensitivity and specificity play different roles. However, there is a method that helps to make a decision, as well as evaluate the test as a whole. This method is called the ROC-curve, a “working characteristic of the receiver” curve, used initially in radar. Build it in R.

First, install the ROCR package (the gtools, gplots and gdata packages will be installed with it if you do not have them).

Again loading data.

 # loading data # don't forget to set your working directory tab <- read.csv(file="data2.csv", header=TRUE, sep=",", dec=".") attach(tab)

Now build the curve.

 # area under the curve calculation auc <- slot(performance(pred, "auc"), "y.values")[[1]] # ROC-curve library(ROCR) pred <- prediction(Messages, Bot) plot(performance(pred, "tpr", "fpr"),lwd=2) lines(c(0,1),c(0,1)) text(0.6,0.2,paste("AUC=", round(auc,4), sep=""), cex=1.4) title("ROC Curve")

On this graph, sensitivity is located on the y- axis, and x is on the x (1 - specificity). Obviously, for a good test, you need to maximize both sensitivity and specificity. It is unknown only in what proportion. If both parameters are equivalent, then you can search for the point furthest from the bisector. By the way, in R there is an opportunity to make this graph more visual by adding cut points.

 # ROC-curve with better plotting plot(performance(pred, "tpr", "fpr"), print.cutoffs.at=c(30,40,60,81), text.adj=c(1.1,-0.5) ,lwd=2) lines(c(0,1),c(0,1)) text(0.6,0.2,paste("AUC=", round(auc,4), sep=""), cex=1.4) title("ROC Curve")

That's so much better. We see that the points farthest from the bisector are 40 and 60. By the way, about the bisector and the area under the curve, which we calculated. The bisector is a test of a fool, i.e. 50 to 50. A good test should have an area under the curve greater than 0.5, i.e. area under the bisector. It is desirable to greatly exceed, but never to be less, because in this case it is better to poke at random than to use our method.

Results

In this article, I described how to work with binary classification in R. As you can see, situations where they can be applied can be found in ordinary life. The main characteristics of such tests: sensivity, specificity, likelihood rate and predictive value. They are interconnected and show the effectiveness of the test from different angles. In the case of quantitative data, they can be adjusted by selecting the cut-off point. To do this, you can use ROC-curve. The choice is made separately in each case, taking into account the requirements for the test, but as a rule, sensitivity is more important.

The following articles will deal with the analysis of qualitative and quantitative data, t-test, chi-square test and much more.

Thanks for attention. Hope you enjoyed it!

Sample files

Source: https://habr.com/ru/post/167341/

All Articles