📜 ⬆️ ⬇️

Fbi Detected: How I Discovered FBI Agents

In the new issue of Black Archeology Datamining, we will play a little bit in spies. We'll see what a regular Data Specialist can learn from open data on the network.

It all started with an article on Habré , that a certain anonymous hacker shared data on FBI agents that were merged into a network. I received this data, and began to look, what can I do with it? In the data there is only a surname, a name, and official e-mails and a telephone - some information.


')
After receiving this data, I saw that they end with the letter J. That is, it is not full. Interesting, what is its full size? To find it, you need to build statistics on the frequency of occurrence of surnames.

For this, I began to look for sets of American surnames, and here I was waiting for the discovery - in America you can find open data on, say, state voters - as I understood it, quite legally. For example, I receive the data of all voters in the state of Utah in half an hour without any problems.



This is much more interesting! If in the first dataset we only had the last name, first name and one letter of “middle name” (here I call middle name the middle name, although this is not so slightly ), then we can find much more information on the FBI agent - for example, the postal address, full name, age, political preferences. So let's get started.

To begin with, let us estimate the completeness of the dataset (from where my research began). We build the statistics of the occurrence of surnames in the state of Utah, then we summarize, and we look at - what proportion are the surnames up to the letter J. It turns out that we have about half of all the data, more precisely 43% . A full list of agents would be 50 thousand entries. Yes, if someone needs it, here is the frequency distribution of American surnames:
Spoiler header
LetterTotal recordsFrequency
A1289340.030
B4010480.093
C2986680.069
D1970780.046
E804670.019
F1525000.035
G2003490.046
H3255910.075
I177650.004
J1214520.028
K1840070.043
L1832660.042
M3997680.093
N736070.017
O531660.012
P1991950.046
Q58020.001
R2241240.052
S4566420.106
T1472290.034
U105590.002
V520850.012
W2720870.063
X3710.000
Y284680.007
Z276420.006



Next, we find agents in the voter list. First, we will try to find intersections by last name, first name, and the first letter of the middle name (this is all the information that we have for agents). The dataset of voters is very large, and with this action we will significantly reduce it so that it at least fits in the memory of my very ancient computer.

I find intersections - and then the first surprise awaits me. There are a lot of them - almost 15 thousand from 22 thousand in the file of agents. It is unlikely that all the FBI lives in the same state, just in America there are very popular surnames, and there are too many surnames First Name-First Name of the middle name. Well, we will filter further.

We find the names that occur only once. These are rare surnames, and most likely the coincidence of the Surname-Name will be quite enough to identify the person. It is unlikely that we will meet another Serine Hovhannisyan. Having executed a filtration, we receive from 193 unique records. There is!

Most likely, these are our agents, with full data - mailing address, full name, date of birth, political preferences (we have the voter list, and it has data on how this person voted since 2002) . Just in case, I will not publish the result, all of a sudden the Agency has really long hands :)

Better calculate the statistics for this data. For example, histogram of age:



Minimum age: 21 years (from this age you can vote)
Maximum: 90 years

Political preferences. I determined the affiliation to the party either by the declared affiliation (such information is in the dataset, or if a person constantly votes for one of the parties.
Of the 193 people, 43 are Republican and 32 are Democrat .
Interesting information, I thought the Republicans would be much more.

How true is this data? In the above link to reddit in the comments there are links to datasets of most states. Information could also be collected from social networks, and ... no thanks. I don’t want to spend the rest of my life at the Ecuadorian embassy .
Oh, someone rings the doorbell - one second, I’ll see who is there. And then I will write about how to save

Source: https://habr.com/ru/post/280065/


All Articles