There are topics on which very few articles on the specialty of data science, but which are of interest to security professionals. These are statistical studies of usernames and passwords - data obtained by “black archaeologists” date of mining.

It was interesting to see some patterns and for this purpose I took a database of passwords leaked in 2014 from Yandex, Google and Mailru, with a volume of 6 million entries.
Data processing
The extracted data consisted of three text files, in which the standard form login@domain.ru: passwd contained logins and passwords from various mail services. The total number of records about 6 million.
')
Processing such a huge array of data represented a non-trivial task: for example,
any character I thought to make a separator in a text file was found among logins or passwords. Yes, there were all kinds of quotes, special characters that are not available on the keyboard (for example, such as §), and even tab characters.
In general, since there are @ and “:” signs in both logins and passwords, I couldn’t think up exact parsing of files to be divided into the “login”, “domain” and “password” fields. For example, how to automatically parse lines of this type:
user @ mail.ru @ gmail.com: 123123
(here, the first @ refers to the login, the second to the domain)user: 123@mail.ru: @ 123123
(and here the first @ belongs to the domain, the second to the password)The data was grouped by domain. I had 7423 domains in total. More than a million times occur gmail.com, mail.ru, yandex.ru
Of the interesting: the domain gmail.com777 occurs 295 times. Many times there are domains gmail.com and several sevens. The reason for this remains a mystery to me. Why exactly the number 7 is not clear.Further, all domains were grouped into four groups: GMAIL MAILRU YANDEX OTHER. The domains that belonged to or belonged earlier to this postal service fell into one group (for example, mail.ru, bk.ru, list.ru, inbox.ru, etc. were in the mail list). The distribution of entries was as follows:
domain count
GMAIL 2308234
MAILRU 1978822
YANDEX 1640733
OTHER 158896
After that, I decided that the data is ready for analysis.
Data analysis: bots hypothesis
Let's test the hypothesis that most of the accounts that have leaked into the network were created by bots.
The first criterion I came up with is a random sequence of characters in the login. For verification, I took a random sample of 6000 logins, and just looked through it with my eyes. The speed of work is the best option, writing any script would take more time. The criterion is not confirmed - there are very few random logins. On a sample of 6000 logins randomly generated no more than twenty.
The next criterion is the distribution of password lengths. Let's look at the distribution of the length of the login - this is a uniform distribution, which cannot be said about passwords.


Obviously visible outliers in the length of 6, 8, and 10 characters. This is probably the very automatically generated passwords that may belong to bots.
Now we will calculate the number of such “outliers” from the uniform distribution of quantities. To do this, I calculated the expected values ​​of columns 6, 8, 10, simply by constructing a smoothed curve - and then I found the difference with the real values.
ResultLength 6: 1010907
Length 8: 763313
Length 10: 246115
Total: 2020335
The result: approximately 2 million (that is, the third part) of passwords generated artificially.
Checking the login-password pair for different domains
Now that's what this research was all about: I wanted to check how often people set the same password for different mail services - and how secure it is.
We divide the data into four subgroups by domains, and look for intersections: by login, and by login-password pair. Results:
Domain pairs | Matching logins | Matching login-password pairs |
GMAIL - MAILRU | 2362 | 121 |
GMAIL - YANDEX | 2421 | 215 |
MAILRU - YANDEX | 42005 | 33313 |
GMAIL - OTHER | 924 | 63 |
YANDEX - OTHER | 7075 | 6732 |
MAILRU - OTHER | 4085 | 3339 |
We see that there are much fewer intersections between gmail and Russian-language domains than between mail.ru and Yandex. You can also see that if for gmail more than 90 percent of people come up with a new password, then for a pair of Yandex mails, on the contrary - 80% of passwords are the same!
Cross Password Security Check
Now let's see what these passwords are, how secure they are. To do this, first build the top password. Of the 6 million passwords, 3.2 million are unique, the rest are duplicated at least once. Choose the size of the stamp: how many people can have mailboxes? Hardly more than 40. Then we take the top 5000 passwords, the lower frequency is just 40. This means that if your password got into the top 5000, then it appears in the merged data more than 40 times - and most likely, it is used by someone else. . Now let's see how many passwords from intersections fall into such a top.
For mailru-yandex intersections (total 33313 intersections):
Top 5000: 2485 password (7.4%)
Top 100: 575 passwords (1.7%)
From the interesting: In the first place from this sample the password is 123456. And in the second place, which is strange, the password is natasha. In general, female names in the top are quite common.
For gmail - mailru intersections (total 121 intersections)
top 5000: 12 passwords
top 100: 7 passwords
Conclusion: although for identical logins 80 percent of passwords match, of these passwords, 93 percent are quite secure.
And for a snack - those passwords from the intersections of gmail-mailru, which are in the top 5000:
passwords123456
262,626
12345
lopata
prodigy
qwerty
qwe123
udacha
1234
svetlana
1q2w3e4r
azsxdcfv
In the next post about black archeology, the date of mining: checking the login and password matches (full and partial), as well as other truth studies and myths about passwords.