📜 ⬆️ ⬇️

Analysis of lost passwords Gmail, Yandex and Mail.Ru

Most recently, public password databases of popular mail services [ 1 , 2 , 3 ] have come into public access and today we will analyze them and answer a number of questions about the quality of passwords and the possible source (or sources). We will also discuss the quality metrics of individual passwords and the entire sample.

No less interesting are some anomalies and patterns of the password databases, perhaps they can shed light on what could serve as a source of data and how this sample is dangerous from the point of view of the average user.

Formally, we will consider the following questions: how strong are the passwords in the database and could they be collected by dictionary attack? Are there signs of phishing attacks? Could data “leakage” be the only data source? Could this database be accumulated over a long period or was the data exclusively “fresh”?
')
The structure of the article:

  1. Data description
  2. Invalid passwords and non-passwords
  3. Password length distribution
  4. Password strength distribution
  5. Dictionary attack
  6. Top passwords
  7. Gmail sample
  8. Rambler sample
  9. Open source analysis
  10. Conclusion


Data description


Data from all three databases is a set of address-password pairs, separated by a colon. No other "meta-data" is available. However, the data is quite noisy i. they contain strings that are not mail addresses or valid passwords.



If we examine the features of the data, we can put forward (or refute) the hypothesis of what process the passwords could be obtained.

Invalid passwords and non-passwords


The simplest criterion for invalid password is the discrepancy between the length of the password and the requirements of mail services.



The data obtained suggests that the passwords from the sample could not be obtained as a result of an “internal” leak, since several thousand passwords are not valid passwords in principle due to restrictions on the password length of six characters (and for modern passwords gmail eight characters) .

Consider these abnormally long (over 60) and short passwords (less than 6) in detail.

Examples

Long passwords are pieces of HTML code, one of the representative examples:



Such examples indicate that phishing could be one of the sources of passwords. The record in the database was clearly not verified by a person and obtained automatically, phishing is also indicated by the fact that the html markup is present in the password, which is rather unusual for stealing a password through infection.

Brief selection of passwords that are too short:



Another indicator that phishing could be one of the sources is the absence of a login and password in the records. Particularly interesting is the apostrophe without a password. Perhaps a potential victim guessed the phishing form and tried to check for SQL injection.

What can be unequivocally asserted by verified data? Automatic database validation did not occur. The most likely hypotheses are phishing and virus infection.

In order to assess the quality of the entire sample, we will remove from it deliberately incorrect passwords of length less than 6 and more than 60 and consider the distribution as a whole according to several parameters.

Password length distribution


As you can see from the graph below, most passwords are 8 characters or less long. This may indicate that a significant layer of passwords is potentially unstable to various types of attacks of over-the-counter attacks.



Password strength distribution


In order to test this hypothesis, consider a simple password strength metric based on
PCI standard .
Let the password receive a conditional score for satisfying one of the following conditions:

If the password receives 4/5, then we call it reliable (very reliable for 5/5), respectively, 3/5 we will call average, and 2/5 weak (0 or 1 point we will call very weak). The R code is shown below.

Reliability function
library("Hmisc") strength <- function(password){ # must contain at least 7 characters score = 0 if (nchar(password) >= 7){ inc(score) <- 1 } # at least one digit if(grepl("[[:digit:]]", password)){ inc(score) <- 1 } # at least one lowercase letter if(grepl("[[:lower:]]", password)){ inc(score) <- 1 } # at least one uppercase letter if(grepl("[[:upper:]]", password)){ inc(score) <- 1 } # at least one special symbol if(grepl("[#!?^@*+&%]", password)){ inc(score) <- 1 } # 0-1 very weak # 2 - weak # 3 - medium # 4 - strong # 5 - very strong return(score) } 


Then the distribution of reliability is:



As you can see from the graph, most passwords fall into the non-secure category. As an example, consider zero security passwords, as this is most likely another representative example of invalid passwords.

Zero Passwords



As can be seen from the examples above, these passwords are not valid (and from a person’s point of view they look more like an input error than a valid password), since postal services do not allow registering the box if they consider the password too simple, for example, repeating the same character six times. This means that perhaps even a larger layer of passwords is not valid according to modern requirements.

Is it possible that a substantial part of the database was collected over a long period of time, when the requirements for passwords were softer? Otherwise, it is quite difficult to explain such a large group of passwords that do not meet the requirements of modern mail systems.

Dictionary attack


As an additional argument, we will conduct the following experiment: take a sample of relevant password dictionaries from general access, conduct an attack on the available passwords on these dictionaries and estimate what percentage of passwords is contained in this selection of dictionaries (the author literally did not go beyond the first three Google links on request [password dictionary ]).



The table above shows that a significant proportion of passwords is contained in dictionaries, which also indicates that some passwords could have been obtained as a result of a dictionary attack (or some sort of brute force modification).

Top passwords


We present a selection of the most popular passwords and note that most of them are not valid passwords now.



Gmail sample


The actions and data described and obtained in this and the next part were produced and transmitted by a friend of my friend who wished to remain anonymous.

Task: check the validity (i.e. that the password is really appropriate) of the password. Action: on a small sample of ~ 150-200 try to get access to the boxes. Of the entire sample, in principle, ~ 2-3% are valid (after a few hours of publicly available data), and virtually all are deactivated at the time of the check. Less than 1% of the boxes were actually operating and they were abandoned by the owners for at least a year.

Rambler sample


It is easy to find in the network lists of “truly valid” addresses compiled by a wide range of interested parties (aka kulkhackers).



Interestingly, among them is a rather large percentage of the rambler addresses.

Rambler was warned a few days before publication and a reply was received that the necessary security measures would be taken soon.

<humor> </ humor>

Interestingly, the percentage of valid passwords is significantly higher and, until recently, rambler was outside the media field of events and did not activate additional security systems.

This allowed an unknown leakage anthropologist to assess the last moments of mailbox life. Despite the validity of passwords, all the boxes were abandoned for a long time (~ 1-1.5 years) and ended with one of these letters:



Which is another confirmation of the phishing hypothesis and the cumulative nature of the base.

Open source analysis


Let us return to the consideration of open sources. Active search for username passwords led us to a series of distributions from gaming forums:



It turns out that part of the list was already walking around the network in some form.

Thus, the data allows to reject the hypothesis of a single source of data such as “internal leakage”.

The main part of the code used:
Data analysis and visualization
 source("multiplot.R") source("password_strength.R") library("ggplot2") print("loading yandex data") yandex <- read.csv("yandex.txt", header = FALSE, sep = ":", quote = "", stringsAsFactors = FALSE) print("loading mailru data") mailru <- read.csv("mail.txt", header = FALSE, sep = ":", quote = "", stringsAsFactors = FALSE) print("loading gmail data") gmail <- read.csv("gmail.txt", header = FALSE, sep = ":", quote = "", stringsAsFactors = FALSE) ##testing if data loaded correctly print("testing, if loaded correctly") print(head(yandex)) print(head(mailru)) print(head(gmail)) ##changing names names(yandex) <- c("email", "password") names(mailru) <- c("email", "password") names(gmail) <- c("email", "password") print("computing lengths of passwords and adding to the datasets") yandex$pass_length <- sapply(yandex$password, nchar) mailru$pass_length <- sapply(mailru$password, nchar) gmail$pass_length <- sapply(gmail$password, nchar) print("number of invalid passwords by length") print(nrow(yandex[yandex$pass_length < 6,])) print(nrow(yandex[yandex$pass_length > 60,])) print(nrow(mailru[mailru$pass_length < 6,])) print(nrow(mailru[mailru$pass_length > 60,])) print(nrow(gmail[gmail$pass_length < 6,])) print(nrow(gmail[gmail$pass_length > 60,])) print("removing invalid passwords by length") yandex <- subset(yandex, pass_length >= 6 & pass_length <= 60) mailru <- subset(mailru, pass_length >= 6 & pass_length <= 60) gmail <- subset(gmail , pass_length >= 6 & pass_length <= 60) #print("checking that they are removed") print(nrow(yandex[yandex$pass_length < 6,])) print(nrow(yandex[yandex$pass_length > 60,])) print(nrow(mailru[mailru$pass_length < 6,])) print(nrow(mailru[mailru$pass_length > 60,])) print(nrow(gmail[gmail$pass_length < 6,])) print(nrow(gmail[gmail$pass_length > 60,])) print("visualizing distribution of password lenghts by provider") gmailcolor <- "deepskyblue" yandexcolor <- "orangered1" mailrucolor <- "limegreen" pgmail <- ggplot(data=gmail, aes(x=pass_length)) + scale_x_discrete(limits=seq(6, 20, 1), breaks=seq(6, 20, 1), drop=TRUE) + geom_histogram(colour="black", fill=gmailcolor, aes(y=..density..)) + coord_cartesian(xlim=c(5,21.5)) + xlab(expression(" "))+ ylab(expression(""))+ggtitle("Gmail") pyandex <- ggplot(data=yandex, aes(x=pass_length)) + scale_x_discrete(limits=seq(6, 21, 1), breaks=seq(6, 21, 1), drop=TRUE) + geom_histogram(colour="black", fill=yandexcolor, aes(y=..density..)) + coord_cartesian(xlim=c(5,21.5)) + xlab(expression(" "))+ ylab(expression(""))+ggtitle("Yandex") pmailru <- ggplot(data=mailru, aes(x=pass_length)) + scale_x_discrete(limits=seq(6, 20, 1), breaks=seq(6, 20, 1), drop=TRUE) + geom_histogram(colour="black", fill=mailrucolor, aes(y=..density..)) + coord_cartesian(xlim=c(5,20.5)) + xlab(expression(" "))+ ylab(expression(""))+ggtitle("Mail.ru") multiplot(pgmail, pyandex, pmailru, cols=3) print("computing strength of the passwords") yandex$strength <- sapply(yandex$password, strength) mailru$strength <- sapply(mailru$password, strength) gmail$strength <- sapply(gmail$password, strength) print(head(yandex)) print(head(mailru)) print(head(gmail)) scale <- scale_x_discrete(limits=c(1,2,3,4,5), breaks=c(1,2,3,4,5), drop=TRUE, labels=c("\n", "", "", "", "\n")) pgmail <- ggplot(data=gmail , aes(factor(strength))) + geom_bar(colour="black", fill=gmailcolor) + xlab(expression(""))+ coord + ylab(expression(""))+ggtitle("Gmail") + scale pyandex <- ggplot(data=yandex, aes(factor(strength))) + geom_bar(colour="black", fill=yandexcolor, binwidth=0.5) + xlab(expression(""))+ coord + ylab(expression(""))+ggtitle("Yandex") + scale pmailru <- ggplot(data=mailru, aes(factor(strength))) + geom_bar(colour="black", fill=mailrucolor, binwidth=0.5) + xlab(expression(""))+ coord + ylab(expression(""))+ggtitle("Mail.ru") + scale multiplot(pgmail, pyandex, pmailru, cols=3) print("Zero strength passwords") print("GMAIL") print(head(gmail[gmail$strength == 0,])) print("YANDEX") print(head(yandex[yandex$strength == 0,])) print("MAILRU") print(head(mailru[mailru$strength == 0,])) table_gmail <- sort(table(gmail$password) , TRUE) table_yandex <- sort(table(yandex$password), TRUE) table_mailru <- sort(table(mailru$password), TRUE) print("gmail most frequent") print(head(table_gmail, 100)) print("yandex most frequent") print(head(table_yandex,100)) print("mailru most frequent") print(head(table_mailru,100)) only_pass_gmail <- gmail[ ,2] write.csv(only_pass_gmail, "only_pass_gmail", row.names = FALSE) only_pass_yandex <- yandex[,2] write.csv(only_pass_yandex, "only_pass_yandex", row.names = FALSE) only_pass_mailru <- mailru[,2] write.csv(only_pass_mailru, "only_pass_mailru", row.names = FALSE) 



Experiment code 'dictionary attack'
 #!/bin/bash data=sample_mailru dict=saved_dict_mailru > $dict j=0 while read p; do ((j++)) echo -n $j if grep -q "^$p$" dictionary/*; then echo " in " echo $p >> $dict else echo " out " fi if (("$j" > 10000)); then break fi done <$data 



Conclusion


Thus, it seems most likely that this sample is a compilation of various sources (phishing, infection, word-digging attacks, a collection of popular collections) for a long period of time. A sufficient part of the data is in principle not valid passwords according to formal syntactic criteria, which was also confirmed by experimental verification.

From the user's point of view, this event does not pose a significant danger and rather looks like an attempt to create an inflower.

UPD. Another evidence that merged data is a compilation of various sources is the presence of a collection of gmail accounts with "+" features in the database, when the address looks like name + domain \ word at gmail.com (for reminding me about this feature, thanks to geka )
Top 10 domains from the sample (the entire list here )
176 xtube
132 daz
88 filedropper
66 daz3d
64 eharmony
63 friendster
62 savage
57 spam
54 bioware
52 savage2
...
11 paygr
11 comicbookdb
...
About paygr: user gkond wrote that
Found your e-mail in dump. The password that is listed next to it was automatically generated by the paygr.com service, back in May 2011.

at the same time, "paygr" occurs 11 times in the "+" list of gmail. Perhaps their base has also been compromised.

But the most important thing is that comicbookdb recognized that their base was really stolen along with the passwords (thanks to EnterSandman for the link): www.comicbookdb.com/hacked.php

Source: https://habr.com/ru/post/236759/


All Articles