Statistical tests in R. Part 2: Qualitative data tests

This article is a continuation of the first part. In this series of articles, I consider the use of the increasingly popular programming language R for solving common statistical problems.

In this and the next article, I show how to choose the right tests for processing qualitative and quantitative data and implement them in R. These methods allow you to get a real idea of an object, process or phenomenon by any parameter, i.e. allow you to say "good" or "bad." They will not require in-depth knowledge of programming and statistics, and will be useful to people of various kinds of activities.

Interested? Welcome under the cut!

Part 1: Binary Classification
Part 3: Quantitative Tests
')
Before you begin, I will make an important remark. The hardest part is choosing how to analyze the data. R is a tool, it will do everything you tell it. But if you apply the wrong method, your results will not make sense.

Some theory

Testing is used to compare two random variables, i.e. prove the presence or absence of a difference between them. The basic idea is that there is a comparison of the difference between averages and standard deviation.

In statistics, they operate on so-called guitar hypotheses. There are only two of them. The main, H0, that there is no difference:

Alternative, H1, difference of values:

The hypothesis may be of the form x1 <x2, and then you have to use one-sided tests. But in practice this is done extremely rarely.

It is also necessary to set the error of the first kind, α. I mentioned it in the last article as (1 - sensitivity) for binomial classification. This is the error with which we predict H1. Standard α is 5%.

Each test output gives 2 values: test statistics and p-value, which is essentially the same. Briefly explain what it is. As we know, the confidence interval (that is, the interval in which the random variable is with a given probability), which we will compose for H0, is represented by the formula:

where x is the mean, σ is the standard deviation.
ζ is the quantile of the distribution used, i.e. multiplier, which corresponds to the interval of a given probability. α is the error of the first kind, i.e. the probability that x is outside the interval.

The test has the x value of the alternative hypothesis. Test statistics characterize it in the quantile dimension, and p-value in the probability dimension. If the p-value is smaller than α, or the test statistic is greater than quantile, then the alternative hypothesis is outside this interval. In this case, with probability of error alpha, we accept H1, since We believe that the test value can not relate to H0. Test statistics and p-value are test results.

If we build a confidence interval, we get essentially the same thing. Perhaps many will be more natural. Nevertheless, the tests are arranged according to the p-value principle, let's use them and not think at the conceptual level (especially since people have been doing this all their lives, apparently, this is the way they should be).

The choice of methods

Data is paired if both values are from the same object. For example, a comparison of “before and after” results or the use of two algorithms on the same data. It is clear that the pair test is less stringent, i.e. the effect can be proved with a smaller sample.

Stages of working with data

You have identified what you are dealing with. What to do next. Here is some work algorithm.

Description. Get average values, standard deviations, postit distribution graphs
Decision-making
- Appropriate test
- Obtaining a confidence interval (in R included in tests)

High-quality data processing

Qualitative data is “yes” and “no”. Examples of application: comparison of the work of two methods / algorithms / departments of companies, finding the influence of the presence of an antivirus on computer infection / smoking on cancer of the legs / text of an advertisement on the fact of sale. The binominal distribution works for this data.

For example, I took a few abstract binary data, two groups of 25 elements. Each element is 0 or 1. We will work with them, as I indicated above: first, we will beautifully present these data (perhaps after this no tests will be needed), then we will carry out the analysis.

For example, we construct the binomial distribution for the first sample, assuming that the probability value is true. Thus, we get how a random variable should behave on the basis of the available data.

tab <- read.csv(file="data1.csv", header=TRUE, sep=",", dec=".") attach(tab) tab x1 <- X[Group==1] x2 <- X[Group==2] library(graphics) n <- length(x1) p1 <- mean(x1) k <- seq(0, n, by = 1) plot (k, dbinom(k, n, p), type = "l", ylab = "probability", main = "Binominal distribution")

If we are interested in the probability:

 plot (k/n, dbinom(k, n, p), type = "l", ylab = "probability", main = "Binominal distribution")

To describe the data, it is necessary to imagine the average probability with a confidence interval for each group. The probability density function is as follows:

In R binom will help us. Having established it, we use the code:

 library(binom) binom.confint(sum(x1), n, conf.level = 0.95, methods = "exact") binom.confint(sum(x2), n, conf.level = 0.95, methods = "exact")

Already, we can conclude that, although the difference in averages is significant, the confidence intervals overlap, which was generally expected for such small samples. Also note that if you considered the above formula, you would get other results. In this calculation, the exact method is used, while the formula is approximate. The difference decreases with increasing sample.

The chi-square test is used to analyze several binomial samples. It is common because easy to hold with your hands.

 chisq.test(x1,x2)

We were given a warning. The fact is that the chi-square test is approximate (although even here R uses an additional correction, it can be turned off if necessary). For small samples, it gives an error. There is a more complex, but absolutely accurate test that takes into account the uncertainty of some parameters, which is unaccounted for in the chi-square test. This is a Fisher test.

 fisher.test(x1, x2)

P-value of 9.915%, which is more than 5%. Therefore, we leave the hypothesis H0 - there is no difference between the quantities.

For these tests, you can take several groups. In this case, H1 will reflect the difference of at least one group from the others.

Results

This concludes this article. Today I talked about how to use tests, how to choose a suitable test, and also analyzed in detail the analysis of qualitative data, the use of chi-square test and Fisher test. The text was more than R, but there is no way to go. In the next, I will proceed to the analysis of quantitative samples.

Sample files

PS Waiting for edits, additions and alternative points of view in the comments.

Source: https://habr.com/ru/post/168877/

All Articles