How to choose an algorithm for the address filter

Quite often articles appear on Habré with new algorithms for automatically parsing addresses written in one line. In addition, address processing services are provided by various IT companies. In the article we will explain how to use your address base to select the algorithm for automatic address resolution, and what you should pay attention to when testing and developing address filter algorithms.

This article is for everyone who stores customer data and wants to solve one of the following tasks:

make sure that the address exists, so as not to send the parcel or letter to nowhere;
split the address into components to understand where sales are best;
to add the address with the missing information in order to optimize the work plan of the couriers;
standardize addresses to find duplicate records of the same customer;
update and bring the addresses to the directory format in order to pass regulator checks.

The task of automatically parsing mailing addresses seems quite simple at first glance - take and match the address directory ( for example, FIAS ) words from the input string. But all those who take it, are buried in a large number of address features ...

What we know about addresses

To begin, introduce ourselves. We are engaged in the task of automated address analysis for more than 9 years. During this time we have worked with both large companies and small firms. We have accumulated a large sample of addresses describing the format of customer data in order to understand well how our ideas influence the quality of address processing in real systems.

Over the past year, we have developed a new version of the algorithm (we call it the address filter) in order to put an end to the address resolution algorithm.
')

We define the task

We know three ways to get the current correct addresses:

get a good address from the client immediately (for example, with the help of prompts by addresses );
hire operators to manually parse addresses;
automatically parse the data.

The first option is the best, but not suitable for those who already have a large base of addresses of dubious quality.
The second option has a large percentage of well-parsed addresses, but, as our practice shows, is expensive and time consuming.
The third option alone will never provide such a percentage of well-parsed addresses, as in the second option, but it is cheaper and much faster.

We recommend using the synthesized version of the 2nd and 3rd paragraphs:

parse addresses automatically with an indication of the quality of parsing the address;
Addresses with a good quality indicator - send to business processes, and with bad ones - send to operators for analysis.

So you will get a large percentage of parsed addresses for a reasonable amount.

If you decide to use this option or parse the addresses only automatically, you will need to choose the right algorithm for automatic data parsing. How to do this, we will continue.

We prepare addresses

To select an algorithm, it is necessary to analyze the results of processing a certain volume of addresses by different algorithms. It seems logical to take a part of addresses from real data and add them with cosmetic corrections to check what percentage of addresses with errors and typos will be recognized correctly.

Fallacy first: automatically correct any typos - good

Most of our customers, who first encountered the problem of automatic address analysis, and we ourselves at first thought that typos correction was the main thing that any self-respecting algorithm should be able to do.

Subsequently, we realized that the correction of typos looks beautiful only at the stage of demonstrations, when instead of checking the algorithm at their addresses, customers invent unprecedented cases, admiring the transformations of the “ihonravov, Maskva, Yubileynaya, MK” type in “Moscow Region, Yubileiny, Tikhonravov St. In combat conditions, this functionality is not only not used, but also hurts the work with the main address database.

Our research shows that in the source addresses of corporate systems, rarely more than 2% of addresses are typographical errors - among all of our customers the percentage of such systems is less than 5%. The majority of typos (about 95% of all typos) are systemic in nature, that is, it is either a common typo, for example, Maskva , or a correction of the type of street. 3rd Mytishchinskaya >>> st. 3rd Mytishchinskaya or st. Tolstoy >>> st. Tolstoy . These typos can be described by a finite set of rules that will allow them to be corrected.

What is bad typo correction in the general case? Making a correction of all misprints on n-grams, Levenshtein distance, etc., the algorithm tries to attract the address to the directory with a great chance to get something completely different from what was meant in the original address. In addition, the source address may contain additional information that is not in the address directory: company name, business center, how to get from the metro, etc. In the algorithm for correcting typos, these additions are more likely to be perceived as a normal component of the address.

For 9 years of work, we came to the conclusion that it is necessary to make a correction of typos only according to the rules, which guarantee that this typo can be brought only to the correct analyzed options.

Thus, we advise checking the algorithms only on real data without artificial distortions. For example, if you have a Moscow address of Pushkin 13 in the database, then use it, and not Mask Pushikino 13 .

Algorithms with typo correction should be treated with caution. The worst thing to which the use of the algorithm for correcting typos, described above, can lead to is getting incorrectly parsed addresses with a good quality code.

Fallacy second: the percentage of well-parsed addresses is the main criterion for choosing a filter (except for the cost, of course)

Any algorithm for automatically parsing addresses at the input accepts the address, and at the output - it gives it in a standardized form. Usually he is able to return a sign indicating whether the algorithm is sure in address parsing or not. Such a feature is usually called a quality code.

Addresses of our customers with a good parsing quality code automatically go into business processes, and with a bad quality code they are sent for manual parsing. The greater the percentage of addresses with a good quality code, the more the customer saves on the process of manual processing of addresses.

Thus, the main criterion for choosing an algorithm becomes the percentage of addresses with a good quality code.

One important point is often forgotten: it is much cheaper to give an address with a bad quality code to good manually than to correct the consequences in the system, which will result in incorrectly recognized addresses with a good quality code.

For example, we are currently developing a system for estimating the value of real estate, where the cost per square meter is known for each house, which is used to assess the client’s solvency when granting a loan. The system automatically analyzes the new ads for the sale of apartments in the network, standardizes the address and adjusts the average cost in the directory. If among the standardized addresses there will be many addresses with incorrect parsing and a good quality code, we will have many errors in the directory where instead of the real average apartment price it will be several times higher or lower. Such addresses are difficult to find, and they have a strong negative impact on business processes.

It is precisely this that badly corrects all the misprints: the algorithm tries to attract a deliberately bad address to a directory with a good quality code, which increases the percentage of the reverse error, that is, the percentage of addresses with a good quality code, but incorrectly standardized.

What addresses to pay attention to

When comparing address filtering algorithms, look not only at the percentage of addresses with a good quality code, but also at the percentage of incorrectly parsed addresses with a good quality code. It is best to prepare a sample of your addresses, including cases of writing addresses of increased danger, namely:

Addresses with typos or incorrect indication of the address component (for example, the 3rd Mytishchi instead of the 3rd Mytishchi ).
Ambiguous addresses , for which only the source data cannot be unambiguously determine what is at stake, including when analyzed by the operator. For example, missing or incorrectly specified address components: Moscow, Tverskaya can mean both Tverskaya Square and the street.
Error in specifying the type of the address component. According to our data, about 5% of customers' addresses contain certain errors indicating the type of address component: instead of “urban-type settlement” they write “village”, instead of “deadlock” they write “lane” and so on.
Error in specifying the component itself. Most often incorrectly indicate:
- The area in which the settlement is located, if it is located on the border of two regions. For example, in the address of the Moscow region, Dmitrovsky district, town of Zaprudnya , the area is incorrectly specified, correctly - Taldomsky.
- The region in which the object is located. Especially often it meets with the addresses of Moscow and St. Petersburg, for example:
  - Leningrad region, St. Petersburg, Fontanka
  - Moscow region, Moscow, st. Rastorgueva
  - Moscow region, Zelenograd, to 3113
A mistake in a name, rank, or other common misconception. For example, Lev Tolstoy is written instead of Alexei Tolstoy Street, or General Zhukov is written instead of Marshal Zhukov . Sometimes they also give some outdated or local, known only to local residents, name.
Duplicate words in the original string. Sometimes after many transformations and transitions from system to system, addresses with duplicate components are formed. For example, one conference had the address: Moscow, Moscow, Moscow, Moscow, Moscow, Leningradsky Avenue, 39, p. 79 . Obviously, the word Moscow here is written several times by mistake and the algorithm may not take into account duplicates from the original address. But is it always possible to delete duplicates? Another example: Sakhalin, Yuzhno-Sakhalinsk, Sakhalin is the address of Sakhalin Oblast, Yuzhno-Sakhalinsk, Sakhalin Street . A good algorithm should find duplicates only if they are really duplicates and do not divert the address to the wrong parsing.
Garbage or additional information in the source data. This is usually either the name or additional information about the building itself and how to get to it. For example: Ivanov Ivan, Pirogov 20 to 1 total 8/1 room 313, Novosibirsk, NSO or Moscow, Turchaninov, BC Krymsky Bridge, building 6, page 2, 2 minutes from the metro and a good salary . In such cases, all components of the input line that are not address components or frequently encountered additional address information (for example, metro or city districts) should be submitted to the operator for analysis, as they can affect the quality of address parsing.
Outdated addresses. These are the addresses that others now have: sometimes, the streets are renamed, it happens that they move to another settlement, merge, etc. When there are two addresses: Samara 13 passage and Samara Georgy Ratner , then it would be nice to understand that this is the same address. The algorithm should be able to update the address and set a good quality code for it only if it is updated.

We compare algorithms

When we have prepared a test sample, then everything is simple. We process addresses with different algorithms and compare them according to the criteria:

The percentage of parsing good addresses (that is, addresses without garbage, ambiguities, and typos). The algorithm must be able to correctly parse good addresses with a good quality code.
Percentage parse bad addresses. The algorithm should be able to parse bad addresses as well as possible, that is, if the address is bad, but can be well parsed with a good quality code, the algorithm should be able to do it.
The percentage of addresses with an inverse error. The algorithm must contain a minimal reverse error, that is, do not put a good quality code into the addresses with incorrect parsing. We think this is the most important point of all.
The presence of additional properties of the standardized address. The algorithm should provide convenient levers for analyzing and working with addresses with bad quality codes. At the same time, working with tools should be simple and straightforward.

findings

The task of automatic address analysis is not as simple as it seems at first glance. If you decide to choose an algorithm for parsing addresses or write your own, then you need to approach this process correctly: analyze existing addresses, make a representative sample for tests. We hope that this article will help you in this work and all your addresses will be disassembled automatically and correctly.

PS: Within a month we will install a new version of the address filter, which was discussed at the beginning of the article, on dadata.ru . Register to be in the know and be among the first researchers of the new algorithm.

Thank you chipQA for help in preparing the article.

Source: https://habr.com/ru/post/240633/

All Articles