In the new issue
of Black Archeology Datamining, we will play a little bit in spies. We'll see what a regular Data Specialist can learn from open data on the network.
It all started with an
article on Habré , that a certain anonymous hacker shared data on FBI agents that were merged into a network. I received this data, and began to look, what can I do with it? In the data there is only a surname, a name, and official e-mails and a telephone - some information.

')
After receiving this data, I saw that they end with the letter
J. That is, it is not full. Interesting, what is its full size? To find it, you need to build statistics on the frequency of occurrence of surnames.
For this, I began to look for sets of American surnames, and here I was waiting for the discovery - in America you can find open data on, say, state voters - as I understood it, quite legally. For example, I
receive the data of all voters in the state of Utah in half an hour without any problems.
This is much more interesting! If in the first dataset we only had the last name, first name and one letter of “middle name” (here I call middle name the middle name, although this
is not so slightly ), then we can find much more information on the FBI agent - for example, the postal address, full name, age, political preferences. So let's get started.
To begin with, let us estimate the completeness of the dataset (from where my research began). We build the statistics of the occurrence of surnames in the state of Utah, then we summarize, and we look at - what proportion are the surnames up to the letter J. It turns out that we have about half of all the data, more precisely
43% . A full list of agents would be 50 thousand entries. Yes, if someone needs it, here is the frequency distribution of American surnames:
Spoiler headerLetter | Total records | Frequency |
A | 128934 | 0.030 |
B | 401048 | 0.093 |
C | 298668 | 0.069 |
D | 197078 | 0.046 |
E | 80467 | 0.019 |
F | 152500 | 0.035 |
G | 200349 | 0.046 |
H | 325591 | 0.075 |
I | 17765 | 0.004 |
J | 121452 | 0.028 |
K | 184007 | 0.043 |
L | 183266 | 0.042 |
M | 399768 | 0.093 |
N | 73607 | 0.017 |
O | 53166 | 0.012 |
P | 199195 | 0.046 |
Q | 5802 | 0.001 |
R | 224124 | 0.052 |
S | 456642 | 0.106 |
T | 147229 | 0.034 |
U | 10559 | 0.002 |
V | 52085 | 0.012 |
W | 272087 | 0.063 |
X | 371 | 0.000 |
Y | 28468 | 0.007 |
Z | 27642 | 0.006 |
Next, we find agents in the voter list. First, we will try to find intersections by last name, first name, and the first letter of the middle name (this is all the information that we have for agents). The dataset of voters is very large, and with this action we will significantly reduce it so that it at least fits in the memory of my very ancient computer.
I find intersections - and then the first surprise awaits me. There are a lot of them - almost 15 thousand from 22 thousand in the file of agents. It is unlikely that all the FBI lives in the same state, just in America there are very popular surnames, and there are too many surnames First Name-First Name of the middle name. Well, we will filter further.
We find the names that occur only once. These are rare surnames, and most likely the coincidence of the Surname-Name will be quite enough to identify the person. It is unlikely that we will meet another Serine Hovhannisyan. Having executed a filtration, we receive from 193 unique records. There is!
Most likely, these are our agents, with full data - mailing address, full name, date of birth, political preferences (we have the voter list, and it has data on how this person voted since 2002) . Just in case, I will not publish the result, all of a sudden the Agency has really long hands :)
Better calculate the statistics for this data. For example, histogram of age:

Minimum age: 21 years (from this age you can vote)
Maximum: 90 years
Political preferences. I determined the affiliation to the party either by the declared affiliation (such information is in the dataset, or if a person constantly votes for one of the parties.
Of the 193 people,
43 are Republican and
32 are Democrat .
Interesting information, I thought the Republicans would be much more.
How true is this data? In the above
link to reddit in the comments there are links to datasets of most states. Information could also be collected from social networks, and ... no thanks. I don’t want to spend the rest of my life at
the Ecuadorian embassy .
Oh, someone rings the doorbell - one second, I’ll see who is there. And then I will write about how to save