Parsing email addresses from a string in C #

Not so long ago, I was faced with the task of uploading data from one of my customers to the next near-state format. Among other things, the unloading required to provide structured postal addresses of individual customers, including the index, region, district, and so on, before the apartment number.

Everything would be fine, but the ambush is that the original addresses of the customers were scored as a simple line like "Kitezhgrad, Volshebnaya St. 22 Building Apt. 15". That is, on the one hand, no one has ever heard about postal codes, on the other hand, the input text field offers a wide scope for self-expression and folk arts and crafts.

In no way did I find it useful to search for a solution to this problem on the network, reasoning that such a situation should be very common and definitely be overcome by someone. So it turned out, however, instead of source or just compiled lib, I was eagerly staring at online services offering parsing email addresses for very real bribes using their API (the minimum price I managed to find was 10 kopecks per address).
Since I did not like to voluntarily give away the income of some third-party organization, and by that time there was a certain excitement, I wanted to skolhozit the decision on my own, with as little effort as possible. The task was facilitated by the fact that the customer did not require high accuracy of parsing - the presence of errors of any kind would not lead to fatal problems.

For a start, I looked in the direction of Tomita-parser, but after getting acquainted with the multi-page configuration of the example that allows the text to determine in which city who lives ( http://api.yandex.ru/tomita/doc/dg/concept/example.xml ) , optimism has somewhat diminished, but the desire to write some kind of bicycle has become stronger.
')
Naturally, with fairly stringent restrictions on the input data, under which we will continue to explore:

The address is always written without typing errors: "Prospect Harpography" let him remain on the conscience of the person who introduces.
The address is recorded from the most common element (region) to the most private (apartment number).
Taking into account point 2, we hammer in a bolt on words-hints like “region”, “street”, “avenue”, “house”. So if there is a Telepuzik Avenue in the city, and a street named after them, then we will not be able to catch such a fine line. Given the rarity of this situation and the availability of the right to make a mistake, it’s quite a working option.

Next, I was puzzled by the search for a data source from which I could get information from zip codes. As it turned out, to this day KLADR is already yesterday, FIAS is so apparently driving ( http://fias.nalog.ru ). After downloading an offline copy of this database, I began to explore the opportunities offered by it.

I was particularly interested in two tables there: ADDROBJ - it stores a tree view of all address objects, starting with the subject of the Russian Federation and ending with the street, and HOUSE <region number> - where house numbers are stored with reference to the records in ADDROBJ along with their indices. The information stored in these two tables is enough to achieve both goals: checking the correctness of address parsing (if you managed to find the address in the database, it means we recognized it correctly), as well as to determine the postal code.

An algorithm began to appear in my head:

We divide the line of the postal address into addressable elements. By address element I mean something about which you can find a row in the FIAS table: district, city, street, house, as well as apartment number.
1. The unconditional delimiters of address elements include periods, commas, semicolons, slashes.
2. Conditional delimiters include a hyphen / dash, if the address element after the hyphen is a number. For example, in the “Depression Alley, 38a-117” hyphen is a separator, and in “ Ust-Zazhopinsk "- no.
3. A space may or may not be a separator. So in “Eighth of March d.15”, the gap between “Eighth” and “March” obviously should not divide the elements, but between “March” and “D.” should. The simplest option in the forehead is to make all possible options for separating the address elements into spaces and continue the further work of the algorithm with each of them separately.
Such address elements as "street" ("ul"), "region" ("regional"), and so on are completely bit off.
Starting from the very first element, they are all consistently driven through the FIAS base.
1. If an element is in the database, its GUID and LEVEL (level in the hierarchy) are remembered, while the next element is searched with a large LEVEL value and a fixed PARENTGUID equal to the GUID of the previous element found.
2. If no element is found for a given PARENTGUID, try to build a chain that includes intermediate elements.
3. The initial search is conducted in the ADDROBJ table, as soon as we look for the next element after the street (LEVEL of the street is 7), we switch to the HOUSEXX table of houses.
4. If the address element is not found, simply ignore it.
The variant wins (and there may be several according to the results of step 1.3), which has the longest recognized chain.
For order it is completed according to the table ADDROBJ to the very top. This is necessary because, for example, the initial address bar did not indicate the region and district, but the city immediately.
Then a little cheat. The apartment number is considered to be the last address element (if it was not recognized as a house number), and the building, building, letter and all the rest are address elements between the recognized house number and the apartment number. It would be possible to build a more detailed analysis - the HOUSEXX table allows for this - but it seemed to me superfluous if only because the postal codes are unlikely to be different for houses “113” and “113 Art. 1 building 4 lit. Zh.”

The algorithm turned out to be empirical, naive, envisaging not all possible situations ... But for limitations on the speed of implementation and extensive rights for error, it looked quite satisfactory. It was possible to compose and implement it in about 1 evening.

For the habit and convenience of work, the ADDROBJ and HOUSEXX tables overtook from DBF to MS SQL (how to convert them easily, read here: http://blogs.technet.com/b/isv_team/archive/2012/2012/14/3497825.aspx ).

The result is the AddressParser class, which retrieves an address string at the input and returns an instance of the Address class. You can submit your own implementation of IKnwonAddressComparator to the AddressParser constructor if the current implementation, sharpened in MS SQL, does not suit you.

By the speed of parsing, it turned out something about 2-5 addresses per second. Bad, but better than handles. Main problem: a serious number of options for verification, generated by clause 1.3. In an amicable way, this point is completely rewritten, using the base of addresses already at this stage to verify the existence of address elements. As an intermediate option, you can limit the number of options to a certain value.

For a random sample, the quality of parsing was 62% on real data. For evaluation, it was considered that the address could either be fully recognized (up to the apartment) or not. No halftones.

The error distribution is as follows:

37% - typos. As a rule, banal omissions-the addition of letters: "Kiirova", "Moscow" ...
21% - the use of abbreviations: "K. Marx", "R. Luxemburg" ...
42% - the lack of houses in the FIAS database, and, as a result, the inability to determine the index and deserve the whole chain. A very unexpected reason for me, although many write that FIAS is still damp for industrial use.

What conclusions can be made?

If you need, like me, low parsing quality and low speed - you can use.

If you want to significantly improve the recognition accuracy, you can try to implement a fuzzy search. In addition, by adding a list of popular abbreviations, you can also pull up the percentage of successfully recognized addresses.

Performance is a separate song, which, given the elementary and non-optimal implementation, can also be done. The first candidate here is a smart breakdown by spaces.

But all this is a completely different story.

Source codes can be downloaded from here: https://yadi.sk/d/muzi9b6qZ8DWh
Test MS SQL database with houses in 38 and 78 regions can be found here: https://yadi.sk/d/ERXyDXv7Z8Dab

Source: https://habr.com/ru/post/232347/

All Articles

Parsing email addresses from a string in C #

More articles: