Parsing addresses with “fuzzy regular expressions”

Summary: about the library written by me for comparison with a given dictionary of expressions in natural language - in particular, city addresses.

To the village grandpa

How many ways are there to write an address - in a geographical sense?

Even if you simplify the task and discard all sorts of little things like house number, building, plot, apartment, etc. - and work within the same city - how many options can you think of for writing the names of streets, squares, passages, alleys, etc.?
')
Take a simple example:
Flower Street - may be abbreviated as “st. Flower "and" Flower St. "- besides," St. "can be skipped (if there is no Flower Square in the city), and" Flower "you can write with errors" Tsvyatochnaya "or" Tsvitochnaya ", as well as" Tsvetoshnaya "- everything it will look not bad!

The example is more complicated:
2nd Horse Army Street - here the soul of the poet is where to roam. The number can be expressed as “2nd” or simply “2”, or even in words “Second” or with the abbreviation “Deut.” - the hyphen between the “horse” and “army” will occur in about 50% of respondents. It will be clear to the heap that the “horse-army” to someone seemed long and reduced to “2 Connoarm. st. "

Other interesting examples are related to the names “st. Sailor Zheleznyaka "(or simply" Zheleznyaka "?)," Etc. Maurice Thorez "(or" Marisa Theresa "?) And also very epic cases" st. 3rd line of the second half "," the road to the village of Rybatskoye "," st. Left Bank of Izhora ”- I ask you to forgive if I myself did not master them to write correctly from memory.

Well this is happiness for business.

Now imagine. that you are developing an application that, among other things, should “comb” (that is, lead to a unified look) all this geography. I ran into this in 2011 - the program had to process ads containing addresses before they were published.

From a business point of view, the necessity of “combing” was relevant for several reasons - not to issue several ads with the same (but differently written) addresses - or not to release ads with addresses from the black list for which you can get a penalty (for example, real estate in respect of which there is a judicial investigation, etc.)

In addition, for example, I counted six flowers in St. Petersburg, in different districts. Therefore, it was necessary to bring the streets in order in order to check whether it is in the specified area, and if not, to search in the nearby areas - if they had not missed.

Before this task fell into my hands, the old application used for the same purpose the services of a special person - the operator - who in a few hours passed about 30,000 lines through himself and corrected them at his own discretion. It was not very convenient for several reasons. This is also because the operator had a funny habit of sending 90% of unfamiliar streets to Kolpino. As a result, checking the stored dictionary files in the Kolpinsky district, I found more streets than in all other areas combined.

... and happiness to the programmer too

Why did the previous version of the application use the services of an operator?

It is not difficult to guess that the software processing of addresses requires some unobvious tricks. Among the amusing suggestions that I heard from bosses and colleagues were:

ANTLR is needed - it was necessary to explain that it is for machine languages and just does not allow excessive liberties, funny typos, etc .;
Use regexps - a proposal to write a regexp for all sorts of typos in the toponym “Flower Street” showed the author of this venture all its inconsistency;
Neural networks - the author of such a bright idea could not, unfortunately, explain in what way it is proposed to apply a neural network. In my estimation, if you make all possible place names of the city as outputs, and each character (or each bit) of the input line is the input, then probably the idea would have been successful - it would take some unimaginable number of options (and time) .

Experiments and approaches

Desperate to find suitable advice, I began to experiment. I read in Wikipedia about the algorithms of fuzzy string comparison. I tried. Indeed, to compare a word with a word is quite not bad.

But what about “st. Palace "and" pl. Palace ”- here the replacement of“ y ”by“ p ”is not a typo, although from the point of view of the algorithm its weight is the same as if“ Palace ”by“ Door ”is replaced.

Well, we divide the string into words (by spaces for example) and calculate misprints for them separately, and then we define something like an average, or we take the maximum error in each word as an error estimate.

This idea worked more interesting. However, the problem of abbreviations, permutations of words, the omission of optional words (as in “Alexander Matrosov Street” - they can easily miss “Alexander”) - she still hasn’t solved.

Without thinking, I undertook to create a "meta-language" to describe these rules. What I got is what I called the "Fuzzy Regular Expressions Library for Java".

In addition to the “reconciliation” functional with the specified expression, it turned out to be convenient to be able to write down the rules for replacing any parts directly in the expression.

Short description

The syntax of the language shows my great love for brackets. Any expression and subexpression is enclosed in brackets, the type of expression is determined by the character following the opening bracket. Round, square and curly brackets for readability are allowed - the main thing is that the opening and closing should match in pairs.

(= street, flower)

The symbol "=" means that the words must be present either in the specified order or in the reverse - i.e. This expression corresponds to either "Flower Street" or "Flower Street".

And yes, the allowable number of misprints is set by the threshold value when creating an object for comparison, by default it is for example 75%, which will allow you to skip both “Tsvyatochnaya” and even “Ulitsu Zvitoshnaya”.

(Barack, (? Hussein), Obama, (? Second))

Comma-separated words in brackets without any special character mean that a string of words is expected in that order. But the expression in brackets with "?" mean optional parts, i.e. if they are missing, they are skipped and errors are not affected by the calculation. In this case, “Barack Hussein Obama”, “Barack Obama II” and even “Barrack Abama” will be perfectly recognized.

(^ (Vladimir, Lenin), (Nikita, Khrushchev), (Karl, Marx))

The symbol "^" means the choice of one of the listed subexpressions - in this case, the expression can be compared with any of the three listed figures, while "Vladimir Marks" and "Nikita Lenin" will not pass.

There are other expressions - for checking with ranges of numbers (it would be inconvenient to apply fuzzy comparison to them), for checking with precise regexp (unexpectedly? But it is useful) - there is also an opportunity to check with abbreviations (using the "*" symbol) grab the named groups to further verify with them or use them in replacements - I described all this in the documentation.

The replacement functionality becomes relevant for large expressions, a brief actual example is difficult to give but we will try:

[^ ((^ Vlad *, Vladimir, V) | Vova) ~ A | Dear_ $ A, ((^ Al *, Alexander, Sasha) | Sasha) ~ B | Protivnyi $ B] ~ C | Hello $ C

It looks cumbersome, but the essence is simple - the name "Vladimir" in different styles will be replaced with "Hello Dear Vova", the different versions of Alexander will be welcomed differently. Fragments for which a replacement is not specified must be replaced with the pattern itself (therefore, for example, “Vlodimer” would be replaced with “Vladimir” if not for the replacement for “Vova”).

Conclusion

The expressions I compiled for the streets of 18 districts of the city (18 expressions in total) were put into a file of 130 kilobytes (however, including indents and other formatting elements), the work on creating a dictionary took me two weeks to complete. Later something similar had to be created for the list of settlements in the region. I hope you can excuse me for not trying to post this spam here - instead of it I will add the expression used below to parse the names of countries.

At the library things are shaky-shaky. It solves a rather narrow task (for many other tasks from the field of natural language processing, it is not suitable). For a couple of years, I see about 1000 downloads. With letters and questions asked about five people. Here it is useful to note that for the English language fuzzy comparison in the same form as for Russian is less effective (it is effective for typos, but not for errors related to pronunciation, because in English words are often written too differently than they are pronounced)

It was planned to rewrite the primitive algorithm - it has an unpleasant flaw with optional end elements due to which it is impossible to glue the expressions programmatically now - but I left for a new job and this task (release 2.0) went into deep “todo”, like the desire to mavenize and can refactor the project (the project was born at the very beginning of my java-career). If there are willing to continue this business in accordance with their own tastes - I will be only too happy!

Links

Fuzzy RegExps for Java is actually my library.

Javadocs - documentation and detailed description of the language of "fuzzy regexp"

TRE - another regexp library (on C) with the possibility of fuzzy comparison - I didn’t figure out how to apply it, it doesn’t have a version for Java, and I’d not have the option of replacing it; Comparison errors in it are also evaluated differently; nevertheless, it may be useful in other problems.

Damero-Levenshtein - a popular simple fuzzy comparison algorithm that I used inside to compare single words included in expressions.

Promised example

An example of the expression, "combing" the country - it was used for real estate ads outside of Russia. I chose it because it is somewhat simpler than expressions for toponyms of a city and at the same time can be used “as is” for similar purposes. (it is shortened so as not to look too disgusting, but the full version is available here ).

 (^
     Abkhazia, Australia, Austria, Azerbaijan, Albania, Algeria, Anguilla,
     Angola, Andorra, Argentina, Armenia, Aruba, Afghanistan, Bahamas,
     Bangladesh, Barbados, Bahrain, Belize, Belgium, Benin, Bermuda,
 // ...
     Fiji, Philippines, Finland, France, Croatia, Chad, Montenegro,
     Czech Republic, Chile, Switzerland, Sweden, Svalbard, Ecuador, Eritrea, Estonia,
     Ethiopia, Jamaica, Japan,
     [^ (Comp. *, Shta *, (? Amer *)), America, USA, USA] | United_States_America,
 // ...
     [^ Comoros, (Comorus *, (@ ISLAND))] | Comoros,
     [^ Congo, Brazzaville, ((@ REPUB), Congo)] | Republic of Congo,
     [^ Zaire, DRC, ({^ DR, (Democr *, (@ REPUB))}, Congo)] | DR_Kongo,
 // ...
     [^ Burma, Myanmar] | Myanmar,
     [(@ISLAND), Maine] | Isle of Man,
     [Nagore *, (? -), Kara *, (? @ REPUB)] | Nagorno-Karabakh_Republic,
     [^ Netherlands, Holland] | Netherlands,
 // ...
     [Yu *, Ossetia] | South_Osetia,
     [^ South Africa, {(^ South Africa *, {S *, (? -), Afr *}), (@ REPUB)}] | South Africa_Republic,
     [Yu *, Sudan] | South Sudan
 )

 :: ISLAND (^ island *, {o, -, v *})
 :: REPUB (^ p, rep, repub *)

Source: https://habr.com/ru/post/192518/

All Articles