📜 ⬆️ ⬇️

DaData.ru finds and destroys the same people.



DaData.ru is a service for automatic verification, correction and deduplication of contact information (name, addresses, phone numbers, email, passports).

I have 453 contacts on my phone. There are duplicates among them: one and the same person is recorded as “Lech”, then as “Alexey Megaphone”, or even as “Zinoviev, Alexey Ivanovich”. Lehi has Skype and his birthday, Alexey Nikolaevich has his email and main mobile number, and Megafon has a spare number from a clear operator.

In telephone contacts, duplicates are unpleasant, but not very annoying. Worse, when such a leapfrog begins with the company's client base.

Problem


When customer contact information is “spread out” across several Excel files or databases, they complicate life:
')

Decision


Find and destroy combine identical customers. This is exactly what DaData.ru does: it finds duplicates among clients, addresses and telephones. Combines them and builds a "reference" client base for marketing, CRM and analytics.



Who is useful:


Easier than writing your bike


Pfff, find duplicates, think. Here, do not thank:

address1 == address2 

Oh yeah, there may still be typos. Then so:

 similarity (address1, address2) > 0.95 

Come on:

 > similarity ( "  11/-89", ", , 11 , 89") > 0.95 False 

It turns out that the data must first be normalized, brought to a “canonical” form (“Moscow time Sukhonsk 11 / -89” → “127642, Moscow, Sukhonskaya ulitsa, d 11, ap 89”). And compared with caution, but it will turn out like this:

 > similarity ( ", - 1-,  20", ", - 3-,  20") > 0.95 True #  

And do not forget when searching for duplicates:


Not the easiest thing. And in Dadat everything is ready.

More precisely than checking manually


People often make mistakes in addresses and phones, or they write the same thing differently:
Novosibirsk, st. Pearl, 2
Zhmuzhna nsk 2, entrance 4
Sovetsky district, Novosibirsk region,
Zhemchuzhnaya street, 2, apartment 98

Therefore, it is hard to compare customers manually: a person does not perceive this data as identical. Of course, you can hire 200 operators to go through the entire base. Work will be long, costly, and as a result, still a lot of duplicates will be missed.

Dadata will process 100 thousand records in half an hour and divide the data into three groups:




Equal Dadata unite itself. And it is better to look at similar ones manually
“Ovchinnikov Fedor, 10/12/1990, Samara Kirova 12” and “Fedor ovchinnikov, Samara, fedor@thefedor.ru” - the same person? You can raise the history of his orders and understand, Dadata will not help here.

How it works and how much it costs


Dadata uses ready-made comparison algorithms for full names, addresses, and telephones, taking into account errors and typos. For eight years we have debugged them on projects with large corporate customers and now we give access to everyone.

When Dadata unites similar clients, from each takes the best: name, address, phone. If there are several addresses or phone numbers, take everything. Same - merges into one.

If customers are not similar enough to combine, reports this:
We will unite such clients
Elena Baeva, was born on 10.11.1990
Moscow, Norilskaya Street, 17, apt. 25

Elena Baeva
Narilskaya MSK, house 17 apt. 25, floor 4
And these - no (father and son)
Alexey Efremov, 06/18/1951
g Novoshakhtinsk, ul Krasnyh Zor, d 7

Alexey Efremov, 03/12/1976
g Novoshakhtinsk, ul Krasnyh Zor, d 7

Works with files, API yet. Write in the comments if needed (and how you would use it).

It costs 25 kopecks per record in the file (10,000 records = 2,500 rubles). Statistics on the file and view 100 entries - for free. Try it yourself .

Source: https://habr.com/ru/post/273251/


All Articles