
DaData.ru is a service for automatic verification, correction and deduplication of contact information (name, addresses, phone numbers, email, passports).I have 453 contacts on my phone. There are duplicates among them: one and the same person is recorded as “Lech”, then as “Alexey Megaphone”, or even as “Zinoviev, Alexey Ivanovich”. Lehi has Skype and his birthday, Alexey Nikolaevich has his email and main mobile number, and Megafon has a spare number from a clear operator.
In telephone contacts, duplicates are unpleasant, but not very annoying. Worse, when such a leapfrog begins with the company's client base.
Problem
When customer contact information is “spread out” across several Excel files or databases, they complicate life:
')
- It is not clear how much the client costs . In 2003, Viktor Petrovich turned to MoiTvoyStrakh and insured life, in 2007 a car, and in 2012 a house. As a result, he was brought into accounting systems three times. It is impossible to automatically calculate profits and losses according to Viktor Petrovich, how much insurance should cost him - it is not clear.
- Newsletter turns into a nightmare . Vika’s marketer manually copies and merges phones and emails from a dozen Excel files. Hurries, swears, wrong. In a hurry, losing two thousand contacts. When it turns out, the director is very unhappy with Vika.
- Furious customers . The provider Svyazinterkom customers are duplicated in three different bases (the company has been operating for 20 years, it has accumulated a notable IT zoo). As a result, the client Fedor repeatedly receives repeated advertising letters, calls and SMS. When Fedor’s patience ends, he goes to Guttelecom.
Decision
Find and
destroy combine identical customers. This is exactly what DaData.ru does: it
finds duplicates among clients, addresses and telephones. Combines them and builds a "reference" client base for marketing, CRM and analytics.

Who is useful:
- Marketer . Build a single list of clients from a dozen excel-files - for distribution or download to CRM.
- Sales Department . Create a register of name and phone numbers from several bases - for telemarketing.
- Trading company . Find the same outlets from different dealers, by comparing to the address - to correctly calculate the profit of the outlet.
- And, ultimately, to the developer . Solve business problems without earning gray hair and not killing the best years of life.
Easier than writing your bike
Pfff, find duplicates, think. Here, do not thank:
address1 == address2
Oh yeah, there may still be typos. Then so:
similarity (address1, address2) > 0.95
Come on:
> similarity ( " 11/-89", ", , 11 , 89") > 0.95 False
It turns out that the data must first be normalized, brought to a “canonical” form (“Moscow time Sukhonsk 11 / -89” → “127642, Moscow, Sukhonskaya ulitsa, d 11, ap 89”). And compared with caution, but it will turn out like this:
> similarity ( ", - 1-, 20", ", - 3-, 20") > 0.95 True
And do not forget when searching for duplicates:
- check in several scenarios: full name + date of birth + phone number, full name + address, address + phone + email - so as not to miss duplicates whose parts of the fields are not filled in;
- come up with an efficient algorithm, otherwise the complexity of O (n 2 ) per 100 thousand clients will give ~ ½10 10 client comparisons with each other;
- to distinguish between "guaranteed" (you can automatically merge) and not guaranteed (first check manually) duplicates - otherwise you merge the excess.
Not the easiest thing. And in Dadat everything is ready.
More precisely than checking manually
People often make mistakes in addresses and phones, or they write the same thing differently:
Novosibirsk, st. Pearl, 2
Zhmuzhna nsk 2, entrance 4
Sovetsky district, Novosibirsk region,
Zhemchuzhnaya street, 2, apartment 98
Therefore, it is hard to compare customers manually: a person does not perceive this data as identical. Of course, you can hire 200 operators to go through the entire base. Work will be long, costly, and as a result, still a lot of duplicates will be missed.
Dadata will process 100 thousand records in half an hour and divide the data into three groups:
- unique: customers that are only in one instance;
- similar: people with similarities in attributes, not strong enough to automatically merge;
- the same: exactly the same people.

Equal Dadata unite itself. And it is better to look at similar ones manually
“Ovchinnikov Fedor, 10/12/1990, Samara Kirova 12” and “Fedor ovchinnikov, Samara, fedor@thefedor.ru” - the same person? You can raise the history of his orders and understand, Dadata will not help here.
How it works and how much it costs
Dadata uses ready-made comparison algorithms for full names, addresses, and telephones, taking into account errors and typos. For eight years we have debugged them on projects with large corporate customers and now we give access to everyone.
When Dadata unites similar clients, from each takes the best: name, address, phone. If there are several addresses or phone numbers, take everything. Same - merges into one.
If customers are not similar enough to combine, reports this:
We will unite such clients Elena Baeva, was born on 10.11.1990 Moscow, Norilskaya Street, 17, apt. 25
Elena Baeva Narilskaya MSK, house 17 apt. 25, floor 4 | And these - no (father and son) Alexey Efremov, 06/18/1951 g Novoshakhtinsk, ul Krasnyh Zor, d 7
Alexey Efremov, 03/12/1976 g Novoshakhtinsk, ul Krasnyh Zor, d 7 |
Works with files, API yet. Write in the comments if needed (and how you would use it).
It costs 25 kopecks per record in the file (10,000 records = 2,500 rubles). Statistics on the file and view 100 entries - for free.
Try it yourself .