Read today on Habré translation of the article
Distribution of characters in passwords . I wanted to spend my little analysis. Of interest to me are the lengths of passwords, the first characters of passwords and the digrams (pairs of adjacent characters) used in passwords. Also, the article will consider an algorithm for improved brute force passwords.
I downloaded the password archive from here:
http://thepiratebay.org/torrent/6443601Used for analysis only 1 file:
Sony_Pictures_International_BEAUTY_USERS.txtNumber of passwords:
20 921The pictures in the article are
clickable .
')
What are the most popular password lengths?
It can be seen that user passwords have different lengths for us. It is surprising that users have passwords shorter than 6 characters. It is strange that the registration system generally allows the use of such passwords. The number of passwords of this length is less than 2.5% of the total number of passwords. There are 2 passwords with a length of 35 - it is very likely that they were obtained by some program for generating passwords.
What characters prevail in passwords?
As expected, the vowels in popularity come first, followed by consonants with numbers. Upper-case characters are used less frequently. The number of uppercase characters is less than 2.8% of the total number of characters.
What characters do passwords most often start with?
Most often, passwords begin with the characters s, m, b, c. The next most popular characters are p, t, d, a, j, l, r. The group of symbols g, k, 1, h, f, w, n, e is less popular. All other characters are less popular in this sense.
Which bigrams (pairs of neighboring characters) in passwords are more common?
As can be seen from the diagram, the distribution of bigrams is not very random. The five most common bigrams in order of decreasing popularity:
ar (1367), le (1315), on (1239), ie (1136), es (1134).
The linearization algorithm for iterating over words of different lengths, the characters of which belong to a countable alphabet.
Now consider the linearization algorithm for iterating over combinations of words of
two characters. Symbols belong to a countable algorithm (not necessarily finite).
What to do if you need to organize a search through the words of the
three characters of the countable alphabet? I do not feel like inventing formulas for traversing a three-dimensional cube. You can organize a bypass of the square, and interpret one of the characters as a number in the two-character word bypass sequence. Thus it is possible to organize a bypass of words of any given length.
Special attention should be paid to bypassing words of
different lengths. It is necessary to additionally linearize the bypass of the “word length” and the word itself. Thus, words of all lengths whose characters belong to a countable set will be searched.
In fact, now we face the task of sorting words of all lengths and with an infinite alphabet. The alphabet is actually finite, but in general, the problem is easier to solve.
What may require all this information?
Now we make the following linear lists:
- Password lengths
- First password characters
- Second symbols of bigrams
Make a list of password lengths by their popularity:
6, 8, 7, 9, 10, 11, 12, 13, 14, 16.15
Make a list of the first characters in passwords for their popularity:
s, m, b, c, p, t, d, a, j, l, r, g, k, 1, h, f, w, n, e, 0, o, i, 2, y, v, S, M, 4, B, 3, C, P, 5, T, D, z, ...
Make lists of second characters from bigrams for popularity for given first letters of bigrams:
e: s, l, n, e, y, t, 1, b, r, c, w, 2, v, x, f, 3.4, a, 7.5, k, 6.9, j, h, u, d, m, 8, z, p, o, ...
a: r, t, s, l, m, c, d, b, i, y, g, p, k, u, h, v, w, x, 2, f, 0, j, 4.9, 3,7,1,6, n, z, e, o, ...
o: n, o, r, l, m, u, s, g, v, k, b, d, c, 1,2, x, i, w, p, t, 0,3,6, j, z, 5,9,7, e, y, a, 8, ...
r: a, i, o, l, d, t, 1, r, k, n, e, g, b, c, 4,5,7,8, s, v, 3,6, u, 9, w, y, h, j, 2, p, m, z, ...
...
Now you can organize a linear search of passwords, starting with the most popular options. And it is possible to solve the inverse problem. From a specific password to bring his number in a looped sequence. This, however, is a little difficult due to the fact that the list of letters of the alphabet is chosen on the basis of the previous letter.
If the topic is interesting to the community, I will try to get the password number in the sequence of passwords in one of my next articles.