📜 ⬆️ ⬇️

How to determine the sample size?

Statistics knows everything. And Ilf and E. Petrov, "12 Chairs"

Imagine that you are building a large shopping center and you want to assess the traffic flow entering the parking area. No, let's give another example ... they will never do it anyway. You need to evaluate the taste preferences of visitors to your portal, for which you need to conduct a survey among them. How to reconcile the amount of data and the possible error? Nothing complicated - the larger your sample, the smaller the error. However, there are nuances.


Charts


Theoretical minimum


It would not be superfluous to refresh the memory, these terms will be useful to us further.



Comparison of two alternative means


Already in the very definitions of the errors of the first and second kind there is room for debate and interpretation. How to determine them and what to choose as zero? If you are investigating the level of contamination of soil or water, how do you formulate the null hypothesis: is there pollution or is there no pollution? But the size of the sample from the general population of objects depends on it.


Distribution characteristics and standard deviation


The initial population , as well as the sample, can have any distribution, however the average value is normal or Gaussian distribution due to the Central Limit Theorem .


Regarding the distribution parameters and the average value in particular, several types of conclusions are possible. The first one is called the confidence interval . It indicates the interval of possible values ​​of the parameter, with the specified confidence factor . So for example, a 100(1-α)% confidence interval for μ would be (Lev. 1).




The second conclusion is to test the hypothesis . It may be something like this.



With a confidence interval of 100(1-α) for μ, you can make a choice in favor of H 1 and H 2 :



If we need to check the value of μ for one sample from the total population, then the criterion will take the form.



Where .


Confidence interval, inaccuracy and sample size


Take the very first equation and express from there the width of the confidence interval (Lev. 2).



In some cases, we can replace the t- with z . With another simplification, we replace the half of w with the measurement error E. Then our equations take the form (Lev. 3).



As you can see, the error really decreases with increasing amount of input data . Whence it is easy to derive what is required (Lv. 4).



Practice - count with R


Let us test the hypothesis that the average value of a given sample of the number of insects in a trap is 1.



Insects0one23fourfive6
Trapsten9fivefiveone2one

 > x <- read.table("/tmp/tcounts.txt") > y = unlist(x, use.names="false") > mean(z);sd(z) [1] 1.636364 [1] 1.654883 

Note that the mean and standard deviation are almost equal, which is natural for the Poisson distribution. The 95% confidence interval for t- and df=32 .


 > qt(.975, 32) [1] 2.036933 

and finally we get the critical interval for the average value: 1.05 - 2.22 .


 > μ=mean(z) > st = qt(.975, 32) > μ + st * sd(z)/sqrt(33) [1] 2.223159 > μ - st * sd(z)/sqrt(33) [1] 1.049568 

As a result, it is necessary to reject H 0 and take H 1 since with a probability of 95%, μ > 1.


In the same example, if we accept that we know the actual standard deviation - σ , and not its estimate obtained using a random sample, we can calculate the necessary n for a given error. Calculate for E=0.5 .


 > za2 = qnorm(.975) > (za2*sd(z)/.5)^2 [1] 42.08144 

Wind correction


In fact, there is no reason to believe that σ (variance) will be known to us, while we have yet to evaluate μ (mean). Because of this, equation 4 is of little practical use, except for particularly refined examples from the field of combinatorics, and a realistic equation for n somewhat more complicated with an unknown σ (Level 5).



Note that σ in the last equation is not with a cap (^), but a tilde (~). This is due to the fact that at the very beginning we don’t even have an estimated standard deviation of a random sample - , and instead we use the planned - . Where do we get the last? We can say that from the ceiling: expert assessment, rough estimates, past experience, etc.


And what about the second term of the right side of the 5th equation, where did it come from? Because Gunther's amendment is needed.


In addition to equations 4 and 5, there are a few approximate formulas, but this already deserves a separate post.


Used materials


  1. Sample sizes
  2. Hypothesis testing

')

Source: https://habr.com/ru/post/339798/


All Articles