Statistics knows everything. And Ilf and E. Petrov, "12 Chairs"
Imagine that you are building a large shopping center and you want to assess the traffic flow entering the parking area. No, let's give another example ... they will never do it anyway. You need to evaluate the taste preferences of visitors to your portal, for which you need to conduct a survey among them. How to reconcile the amount of data and the possible error? Nothing complicated - the larger your sample, the smaller the error. However, there are nuances.
It would not be superfluous to refresh the memory, these terms will be useful to us further.
Already in the very definitions of the errors of the first and second kind there is room for debate and interpretation. How to determine them and what to choose as zero? If you are investigating the level of contamination of soil or water, how do you formulate the null hypothesis: is there pollution or is there no pollution? But the size of the sample from the general population of objects depends on it.
The initial population , as well as the sample, can have any distribution, however the average value is normal or Gaussian distribution due to the Central Limit Theorem .
Regarding the distribution parameters and the average value in particular, several types of conclusions are possible. The first one is called the confidence interval . It indicates the interval of possible values of the parameter, with the specified confidence factor . So for example, a 100(1-α)%
confidence interval for μ would be (Lev. 1).
t-
.The second conclusion is to test the hypothesis . It may be something like this.
With a confidence interval of 100(1-α)
for μ, you can make a choice in favor of H 1 and H 2 :
100(1-α) < h
, then we reject H 0 in favor of H 2 .100(1-α)
> h, then we reject H 0 in favor of H 1 .100(1-α)
includes h, then we cannot reject H 0 and this result is considered undefined .If we need to check the value of μ for one sample from the total population, then the criterion will take the form.
Where .
Take the very first equation and express from there the width of the confidence interval (Lev. 2).
In some cases, we can replace the t-
with z
. With another simplification, we replace the half of w with the measurement error E. Then our equations take the form (Lev. 3).
As you can see, the error really decreases with increasing amount of input data . Whence it is easy to derive what is required (Lv. 4).
Let us test the hypothesis that the average value of a given sample of the number of insects in a trap is 1.
Insects | 0 | one | 2 | 3 | four | five | 6 |
---|---|---|---|---|---|---|---|
Traps | ten | 9 | five | five | one | 2 | one |
> x <- read.table("/tmp/tcounts.txt") > y = unlist(x, use.names="false") > mean(z);sd(z) [1] 1.636364 [1] 1.654883
Note that the mean and standard deviation are almost equal, which is natural for the Poisson distribution. The 95% confidence interval for t-
and df=32
.
> qt(.975, 32) [1] 2.036933
and finally we get the critical interval for the average value: 1.05 - 2.22 .
> μ=mean(z) > st = qt(.975, 32) > μ + st * sd(z)/sqrt(33) [1] 2.223159 > μ - st * sd(z)/sqrt(33) [1] 1.049568
As a result, it is necessary to reject H 0 and take H 1 since with a probability of 95%, μ > 1.
In the same example, if we accept that we know the actual standard deviation - σ , and not its estimate obtained using a random sample, we can calculate the necessary n
for a given error. Calculate for E=0.5
.
> za2 = qnorm(.975) > (za2*sd(z)/.5)^2 [1] 42.08144
In fact, there is no reason to believe that σ (variance) will be known to us, while we have yet to evaluate μ (mean). Because of this, equation 4 is of little practical use, except for particularly refined examples from the field of combinatorics, and a realistic equation for n
somewhat more complicated with an unknown σ (Level 5).
Note that σ in the last equation is not with a cap (^), but a tilde (~). This is due to the fact that at the very beginning we don’t even have an estimated standard deviation of a random sample - , and instead we use the planned - . Where do we get the last? We can say that from the ceiling: expert assessment, rough estimates, past experience, etc.
And what about the second term of the right side of the 5th equation, where did it come from? Because Gunther's amendment is needed.
In addition to equations 4 and 5, there are a few approximate formulas, but this already deserves a separate post.
Source: https://habr.com/ru/post/339798/
All Articles