📜 ⬆️ ⬇️

Benford's law and distributions falling under it


In probability theory and statistics, the first digit rule, or Benford's law , shows a curious manifestation of the frequencies of the first digit of real-life data. For schoolchildren and housewives, this law can be freely formulated as follows: there are data sets in which the first digit will be one approximately 6 times more often than nine and this ratio will not change when the initial set is scaled. More strictly, it can be formulated as follows: a set of numbers satisfies Benford's law if the first digit d appears with probability


Here N is the base of the number system, must be greater than 2, then we will consider 10.
For strict mathematicians, this rule is formulated as follows: there are such random variables for which the probability distribution of the fractional part of the logarithm over any base greater than 1 converges to a uniform one on the interval [0; 1] distribution. Next, I will try to write as popularly and in more detail as possible; I will point out examples, restrictions, application and random variables for which the law is applicable.

Let's start with a classic example. Consider a list of countries by population, for example, from Wikipedia . We will get the table of frequencies for the first digit, this can be done directly by opening the console in Wikipedia:
var populationList = $('table.sortable tr').map(function () {return $('td:nth-child(3)', this).text()}).toArray(); var frequency = [undefined, 0, 0, 0, 0, 0, 0, 0, 0, 0]; var countriesCount = populationList.length; $(populationList).each(function () { frequency[this[0]] += 1 / countriesCount }); 

Now it is possible to compare the resulting frequencies with those of Benford's law, calculated by the formula (1):


')
View and play, substituting your data here jsfiddle.net/Vr3F9 .
Next, we will consider some more data from real life: the population of Russian cities , the list of countries by the number of prisoners and the user rating from dev.by dev.by/users .
Here I will give only the results, how to get them, you already know. Open, please, and do not close the jsfiddle.net/rWbBs diagram, I will explain it now. The image shows that all distributions have a lot in common with Benford's law, and only the list of countries in terms of the number of prisoners has some inconsistencies with the law, but it still looks like it more than, say, a uniform distribution law. In real-life data, this is exactly what happens that, in addition to distributions that are well-suited to the law, there are those that remain “similar” to it, even with a sufficiently large increase in the number of observations, the mathematical part of this question will be said, for now just remember how fact.

Now let us ask ourselves what observable values ​​from real life satisfy the Benford law. Here is a list of some of them.
1. Population of countries and cities. As a result: the results of demographic measurements, election results, regional indicators proportional to the population.
2. The areas of river basins, the area of ​​countries and territories, the size of the islands.
3. Circulation of newspapers and books.
4. Daily expenses. Just look at all your purchases over a period of time.
5. Indicators of changes in financial markets (a separate big topic, I am sure that one of you knows it better than me).

Now is the time to tell about the laws that do not fall under Benford's law. This law does not include the highly popular uniform and normal distributions. Also, there are distributions with a common first unit and, at first glance, similar to Benford's law, for example, the atomic mass of elements and basic physical constants. About the atomic mass of the elements I want to say especially. Benford himself in the original article, the title image of which I made to the CDRD , indicated the atomic mass as an example of a law satisfying the “first digit rule”, however, the distribution described by formula (1) does not satisfy this distribution, although the obvious predominance of the first unit is visible. And this is how easy it is to be convinced of this: it is enough to multiply all the values ​​by some identical number and see what happens.



Benford's law is scalable: when multiplying all the observed values ​​by the same number, it continues to be executed, with atomic mass it can be seen that this is not the case.

We turn to mathematical objects that satisfy the law. Here is a small list:
1. A sequence of powers of two, and any other exponential sequence.
2. Fibonacci numbers.
3. Factorials.
There are also two distributions from the Theory of Probability course, subject to this law. The first is the gamma distribution , as k → 0 this distribution obeys the law (see the link to the article mi.mathnet.ru/tvp113 ). I have not seen cases where in practice I had to work with this distribution with such a small parameter.
But the second class of distributions is more interesting. It turns out that one-sided stable distributions converge to Benford's law (link to the article: mi.mathnet.ru/tvp244 ). I wrote about stable distributions in my previous article . Let me remind you briefly what it is. Here, it may be a little difficult, you can immediately read the next paragraph, the main idea you will not lose. There is a class of distributions that satisfies the generalized central limit theorem, that is, these distributions, up to scale and shift accuracy, retain their appearance when summed. This property to preserve the form of the distribution function is called stability. These distributions have a whole class described by four parameters, two of which are shift and scale as in a normal distribution, the third asymmetry parameter, and the fourth most important alpha parameter, taking values ​​from zero to two, characterizes the degree of tail weight of the distribution. With alpha equal to two, the stable distribution will be normal, this is a particular, and one can even say degenerate case, since only in this case will the distribution have moments of any order, in all other cases the moments will only be of order less than the alpha. That is, the variance will be infinite, well, statistically it can be calculated and get some value, but it will take very different values ​​for different samples from the same distribution. And with alpha less than or equal to one, there will be no mathematical expectation. That is, our arguments "on average" for this case are simply meaningless. And just for the case of very small alphas and in the case of one-sidedness (they take only positive or negative values, this can be achieved with the right choice of the asymmetry parameter), these distributions satisfy Benford's law.

What can be said in general about one-sided stable distributions with a small tail parameter satisfying the Benford law: they all do not have an average value and one-root, this can be statistically explained as follows: there is a small compared to the difference between the maximum and minimum values ​​where most values ​​will be. The more observations we make, the more likely it is to get an even more abnormal value. So with an example about the list of countries by the number of prisoners: there simply the alpha parameter is not so much aspiring to zero, which means it will simply resemble Benford's law, and not strict conformity to it, even with a very significant increase in the number of observations. Now you can close this example, you will soon expect more interesting.

Now I will try to clarify that there are much more distributions that do not have an average value in real life than it might seem. Imagine that you are walking with a small child, perhaps even your own, through the forest. And he points to a pine tree and asks: "is this a tall tree?" The answer to this question is simple: if it is above average, it means high, and if not quite, then not at all. The average height of the pine can be found. But further, this same child shows you a stone and asks: “is this a big stone?”. Now you will agree, the size of the average stone will be different for each person and the probability of stumbling upon a stone is orders of magnitude larger than the usual one is not so small.

And now more of our example of such a distribution. The size of the file on the media. Try to immediately answer the question, what is the average file size on your system disk? Guess? Go to any folder, preferably having more files on the system disk, and not where things of the same type are collected like photos and music. Now print the list of all files and see which part of the sizes starts with one, two, and so on. I am ready to argue that the nine will be much less frequent than the two (I just had a lot of time to check and even became convinced of the stability of this distribution). The same situation will be with the size of the tables and the number of records in the table in the database. Stability and, as a result, “heavy tail” in this case means that two or three of the largest tables in the database will take up more space than all the others together.

The main application of Benford's law: the definition of possible falsification of incoming values ​​in cases where the values ​​must meet this law: in data transmission networks, data storage systems, in conducting sociological polls and elections, some scientific experiments, and so on. Also, Benford's law is somewhat similar and even related to the Pareto principle and Zipf's law, but these are already separate topics, so there will be a continuation ...

Source: https://habr.com/ru/post/240853/


All Articles