Black archeology of datamining: how dangerous are the “plums” of big data

In 2014, a large, over 6 million records, database of passwords of various mail services leaked to the network. Let's see how relevant these passwords are now, in 2015.

To do this, we compare these email passwords with another large database, the drain of which was no less ambitious - but much more imperceptible for the IT community. In May 2015, a database of all personal data (logins, passwords, emails, profile information) from the website Ask.ru was hit. Apparently, passwords were stored in the database in the clear. All of them were relevant at the time of the drain base.

Some statistics:

The size of the "mail" base	6034544
Site base size	3432650
Login match	132093
Matching login-password pairs	77387

We see that on a single site in one of the 26 cases we will find the desired login in our merged database. At the same time, with a probability of 60 percent, we will be able to pick up the correct password for it.
')
That is, the password on the site, and in the mail, with which registration is made on the same site - is the same in 60 percent. Good result for a hacker!

Now let's check how unique these passwords are. We know the top mail passwords. Let's find how many passwords from the site fall into this top known to us. From 77 thousand get to the top (that is, obviously unreliable)

Top 10	9652
Top 100	10535
Top 1000	11704

That is, only one seventh of the passwords are unreliable, about the other passwords - users are confident of their security. This, I recall, in May 2015 - 9 months have passed since the discharge of the postal base.

Conclusions: apparently, more than half of the users use their e-mail password when registering at various sites, and in the event of its being compromised, they don’t particularly bother to change it. The probability of meeting a merged login on a single site is approximately 1/25, and in half of the cases the password will be the same.

And the most frequently asked question: sorry, no - I cannot share passwords.
Firstly, it would be unethical on my part. Secondly, if you cannot find these bases for half an hour in open access - maybe you just don’t need it?

R-script for matching

##   DATA_1 <- readRDS( file = "DATA_MAIL.rds" ) DATA_2 <- readRDS( file = "DATA_SITE.rds" ) ################################################ # 6034544 nrow(DATA_1) # 3432650 nrow(DATA_2) ################################################ #   : 132093 length( intersect(DATA_1[,1],DATA_2[,1]) ) #   -: 77387 length( intersect( paste( DATA_1[,1], DATA_1[,3], sep = "|" ), paste( DATA_2[,1], DATA_2[,3], sep = "|" ) ) ) ################################################# #   VECTOR_I <- intersect( paste( DATA_1[,1], DATA_1[,3], sep = "||" ), paste( DATA_2[,1], DATA_2[,3], sep = "||" ) ) VECTOR_I <- strsplit(VECTOR_I, "||", fixed=TRUE) DATA_I <- matrix(unlist(VECTOR_I), ncol=2, byrow=TRUE) DATA_I <- as.data.frame(DATA_I) colnames(DATA_I) <- c("login","passwd") ################################################# # ,        -N PASS_SUM <- readRDS( file = "PassSum.rds" ) PASS_10 <- PASS_SUM[1:10,] PASS_100 <- PASS_SUM[1:100,] PASS_1000 <- PASS_SUM[1:1000,] # 9652 length( which( DATA_I$passwd %in% PASS_10$passwd ) ) # 10535 length( which( DATA_I$passwd %in% PASS_100$passwd ) ) # 11704 length( which( DATA_I$passwd %in% PASS_1000$passwd ) )

Previous editions of Black Archeology Datamining
Black archeology date of mining: data analysis
What could be more effective than a dictionary attack?

In the next issue: look for bots, define "random" passwords, examine statistical distributions. Stay tuned!

Source: https://habr.com/ru/post/262305/

All Articles

Black archeology of datamining: how dangerous are the “plums” of big data

More articles: