Recognition of email addresses

The case began with the fact that one small English company decided to send out flyers to its existing and potential customers.
There was a problem: there is a separate internal database of customers who placed orders over the phone; separate database of web clients who placed orders on the site; and several bases of “potential clients” from various informants.
Thousands of clients hit several bases at once, or even several times into one base.
If a client who has “lit up” five times, receives five identical leaflets with a slightly different spelling of a name or address, the effect of such a campaign will be the opposite - not to mention the senseless costs of extra leaflets.
How to weed out replays in the mailing list?

Among all the data about the client, the most unambiguous thing that determines it is the zip code (postcode). This is not enough, but it is a good starting point.

Zip Codes

The indices of the British Royal Mail (Royal Mail) differ significantly from the numerical postal codes adopted in most countries:

alphanumeric indices are used, ranging in length from 5 to 7 characters, with a space in the middle, for example, NW1 6XE for the famous address “221B Baker Street, London”;
one index corresponds to an average of 15 recipients (delivery points) - this is approximately one side of the street from the intersection to the intersection;
Zip codes are widely used for geolocation, for example on Google Maps: maps.google.com/maps?q=NW1+6XE positions the map immediately to the desired quarter; in large cities, for convenience of orientation, the postal district is signed on the houses along with the street name;
The index has a hierarchical structure: region (area; part of the county), district (district; city or part of the city), sector, unit. In our example, the NW area corresponds to northwest London, the NW1 district is part of Westminster and part of Camden . An example of a less densely populated area: the EX ( Exeter ) area corresponds to the northern half of Devonshire, and the EX20 district corresponds to a pair of North Tawton and Okehampton cities and their environs.
As you can see, the division of postal recipients into regions is in no way consistent with the administrative division: many postal areas cross the borders of the counties, and some - the borders of England. The only condition for splitting was approximately equal number of recipients in each mail area.

Each recipient is assigned a two-character DPS (delivery point suffix) - together with the index, it uniquely identifies the mailbox to which the mail is delivered. For example, the house of Sherlock Holmes received a mailbox NW1 6XE 1F.
Such an “extension” of the index is similar to the American ZIP + 4; the difference is that the British DPS is not advertised to the general public, and most residents have no idea what DPS is, and even more so - what DPS is in their mailbox.
The only advertised use of DPS is reduced rates for bulk mailings, with a bar code printed on each envelope, including an index and DPS.

We decided that the pair “Indexes + DPS” suits us as an unambiguous address identifier: although theoretically several clients can have a common mailbox (suppose they rent half a flat), and large customers can have several boxes and a separate mailbox in each case, but the principle of “one mailbox - one flyer” is captivating with its transparency.
')
The action plan was defined: for each address, we define the DPS, and if two entries match the index and DPS, then this is the same address.

Email Addresses

If one is accustomed to the rigidly structured Soviet postal addresses of the “city-street-house-apartment” it can be unclear: why bother with DPS? Why not look for replays directly in the addresses?

The fact is that British postal addresses have a very loose structure, making it impossible to compare them head-on:

a house may have several numbers (for example, “15-17 Railway Road”); there may not be a number at all (for example, Safari House, Hospital Lane);
the street in the address may not be indicated - only the building and settlement; for example, "Former St Mary's Church, Tremadog";
there may be a street and a “sub street”, for example “Second Drive, Dawlish Road, Teignmouth” - unlike it, there is also “Second Drive, Landscore Road, Teignmouth”;
may indicate the area of the city; in different areas there may be streets with the same name. Since there are no clear boundaries between the districts, everyone chooses by his own understanding which district should be indicated in his address and whether it should be indicated at all.
Mail does not recommend to specify the county in the address, but most still indicate. Since large cities are not subject to counties, some residents specify the name of the city twice: once as a settlement, the second time as a county.
special confusion with London: it may be referred to as “Inner London, London”; and as "London, South London"; and in many other ways - despite the fact that according to the postal requirements, most residents of the capital simply write London, and indicate the street - as in the example of the Holmes house.

Agree that automatically determining the identity of “221B Baker Street, London NW1 6XE” and “Sherlock Holmes Museum, Baker St, Westminster, Greater London NW1 6XE” is an impossible task. But the same house could also be indicated as part of a “multi-home” address, for example, 219-221B

Mail base

Royal Mail officially (and cheaply) offers a database of all existing email addresses - Postcode Address File (PAF). For each address, the “standard” spelling, index, and DPS are given. (Purchase of PAF is the only legal way to determine DPS by address.) In the form of archived playtext, the entire database occupies 300MB.

The “standard” spelling of the address is divided into logical fields: the name of the organization, house, street, settlement.
Each of these fields is further divided: the name of the organization may include the name of the department; house name - the number or name of the apartment; and so on.
For example, here are three PAF entries related to addresses on the same street:

Department Organization			Microwhite
Part of the house House number House name	Flat 3 Westwood farm house	Flat sixteen	76A
The name of the town Type of street Street name Street type	Main Street	Main Street	Main Street
Subdistrict District / Suburb City	Keevil Trowbridge	Keevil Trowbridge	Keevil Trowbridge
Index DPS	BA14 6LU 2P	BA14 6LU 2W	BA14 6ND 1J

This example further illustrates the hierarchy of postal codes: BA postal area includes the city of Bath and part of Somerset and Wiltshire counties; BA14 district - Trowbridge town and surrounding villages; sector BA14 6 - a dozen villages east of Trowbridge, and among them Kivil.

Address Recognition

So, we have a client address with an index, and we need to determine its DPS. The frontal approach - to list all the addresses corresponding to the given index, and choose the "most similar" one - was rejected because of the inevitability of the "false positives". Addresses corresponding to one index are most likely distinguished by the number or name of the house; and this is precisely the part of the address in which alternative forms may have little in common with one another. In addition, there is no guarantee that the index in the database is written correctly.

I decided instead to analyze the address “bottom up”, repeating the logic of reading the address by the postman: first recognize the city, then the street, finally the house, and finally compare the index with the recorded one. If the index matched - this is pretty convincing evidence that the address is recognized correctly.

The initial data in the customer bases were filled in four fields: address, city, county and index. Since the full address was divided into these four fields manually, then the content correspondence with the purpose of the fields was rather inaccurate; long addresses are often filled in the first two fields, displacing the name of the city in the field "county"; short addresses - on the contrary, were indicated together with the city in the first field, while the county fell into the "city" field, and the "county" field remained empty. All this caused difficulties in recognizing addresses: it is not clear which part of the address to compare with the list of cities, which part with the list of streets, etc.

Other complexity consisted in numerous misprints and discrepancies in addresses. (The British toponymy is terrifying in many people: “Liverpool is being written, but Manchester is being read”; it is hardly even possible to record an address by ear on the phone without a single error.) To take this into account, an algorithm for fuzzy string comparison was required. Google brought me to the library SimMetrics for MS SQL, released by the University of Sheffield. Having tried several metrics on real examples of addresses, I chose the Jaro-Winkler metric. It is similar in its idea to the Levenshtein distance , but gives more weight to the characters at the beginning of the lines than at the end. It turned out that most typos were actually allowed in the second half of the name; in the first two letters there were practically no misprints.

The overall architecture of my recognizer is a cascade of tabular UDF-functions, each of which “bites off” the piece from the end of the address, tries to recognize it, and filters the input strings or adds new versions of their parsing. The first function considers all eligible counties, the second one considers all eligible cities in these counties, and so on. It turns out that in the course of recognition all (thousands) possible interpretations of a partially read address are considered. Perhaps it would be more effective to use something like the “method of branches and boundaries”: to consider each time only the most likely partial interpretation, and to stop the analysis as soon as a suitable complete interpretation has been found. I decided that such an implementation would not have responded to the “spirit” of T-SQL, focused on uniform processing of arrays of data; in addition, the implementation in the form of a cascade of independent functions facilitates the testing and reuse of code.

County Recognition

So, we start from the end of the address, from the name of the county. By mail rules, specifying the county in the address is optional, and it can be ignored when recognizing the address. We are looking for the name of the county in the field "county"; if we find, we bite off and return the remainder. If we don’t find it, there are two possible options: either the county is not indicated, and the whole line must be returned as a remainder in order for the next function to find the name of the city in it; or the name of a non-existent county is indicated (I met South Buckinghamshire, Central London and even England), and then the whole line needs to be bitten off - we will look for the city in the next one.
A couple of additional subtleties - what can be encountered both the full and abbreviated name of the county; and that if the county coincides in name with the city, then it does not need to be bitten off - we need this city at the second stage.

CREATE FUNCTION Address_GetCounty(@address varchar(30)) RETURNS TABLE AS RETURN select distinct county, isnull(data.leftover, deflt.leftover) leftover, sim from ( select max(county) county, max(leftover) leftover, max(sim) sim from ( select top 1 county, leftover, sim from ( select county, leftover, length, sim, row_number() over (partition by county order by sim desc) as rn from ( select county.name as county, left(@address,len(@address)-lv) as leftover, lv as length, dbo.maxf(dbo.JaroWinkler(county.name,upper(right(@address,lv))), dbo.JaroWinkler(upper(county.abbrev),upper(right(@address,lv)))) as sim from PAF.dbo.Counties_ExceptPosttowns county, dbo.Generate(len(@address)) l where left(right(@address,lv),1) like '[A-Za-z]' --word boundary and (left(right(@address,l.v+1),1) not like '[A-Za-z]' or lv=len(@address)) ) data ) data where rn=1 and sim > 0.92 order by length desc, sim desc union select null, null, null ) data ) data, (select @address leftover union select '') deflt

The function Generate (n) returns the natural numbers from 1 to N; Perhaps, in modern versions of MS SQL, there is already a built-in way to generate such a table.
The threshold is 0.92 for similarity of the line to the sample chosen by experience. It is important that not the most “exactly suitable” option is chosen, but the longest of all “approximately suitable” ones. For example, the tail of the line “Souuth Gloucestershire” fits exactly the pattern of “Gloucestershire”, but we need to bite off not his, but a longer and less accurate match to the pattern - “South Gloucestershire”.

Recognition of a locality

The second stage is the recognition of the city in the “city” field and the unrecognized remnant of the “county” field. Since there are thousands of cities in Britain, the fuzzy comparison of each tail line with each city name takes an unreasonably long time. To optimize the search, we first try to find an exact match (using an index), and only if there is no exact match, do we perform a fuzzy search. There is a drawback to this approach - in cases like “Souuth Gloucestershire”, an exact match will be found, and not the desired longer inaccurate.
Another couple of subtleties: it could turn out that the county was specified twice (for example, in the form of “West Yorkshire, Yorkshire”) - this happened when the address was short, and the person who entered it wanted to fill in all the lines. Therefore, address recognition begins with the fact that we are trying to bite off another name of the county, if it coincides quite accurately with the sample. Another subtlety - apart from the official list of “postal cities”, “pseudonyms” are allowed in the addresses; Royal Mail publishes them in a separate database of Alias, which can be purchased in addition to the PAF. (For example, instead of the regulatory form “Baker Street, London,” it is permissible to write “Baker Street, Westminster”.) Finally, when comparing the names of cities, we consider a single space and a hyphen: an Englishman will write “Bradford on Avon” or “Bradford- on-avon ”, but only one spelling is considered normative.

 CREATE FUNCTION Address_GetPosttownId(@address varchar(60)) RETURNS @res TABLE(local_id int, posttown varchar(30), leftover varchar(60), sim float, byalias bit) AS BEGIN --chip away double county, on a very good match SELECT @address=leftover FROM Address_GetCounty(@address) WHERE sim>0.95; INSERT INTO @res select distinct isnull(Localities.local_id,alias.local_id) as local_id, name.posttown, leftover, sim, byalias from ( select top 1 posttown, left(@address,len(@address)-lv) as leftover, 1 as sim, byalias from (select 0 as byalias, posttown from PAF.dbo.Localities union select 1 as byalias, name from PAF.dbo.LocalityIndex) names, dbo.Generate(len(@address)) l where left(right(@address,lv),1) like '[A-Za-z]' --word boundary and (left(right(@address,l.v+1),1) not like '[A-Za-z]' or lv=len(@address)) and posttown=upper(right(@address,lv)) order by lv /*length*/ desc ) name left join PAF.dbo.Localities on name.posttown=Localities.posttown left join PAF.dbo.LocalityAliases alias on name.posttown=alias.name IF @@rowcount!=0 RETURN; INSERT INTO @res select distinct isnull(Localities.local_id,alias.local_id) as local_id, name.posttown, leftover, sim, byalias from ( select top 1 posttown, leftover, sim, byalias from ( select posttown, leftover, length, sim, byalias, row_number() over (partition by posttown order by sim desc) as rn from ( select posttown, left(@address,len(@address)-lv) as leftover, lv as length, 0 as byalias, dbo.JaroWinkler(replace(posttown,'-',' '),replace(upper(right(@address,lv)),'-',' ')) as sim from (select 0 as byalias, posttown from PAF.dbo.Localities union select 1 as byalias, name from PAF.dbo.LocalityIndex) names, dbo.Generate(len(@address)) l where left(right(@address,lv),1) like '[A-Za-z]' --word boundary and (left(right(@address,l.v+1),1) not like '[A-Za-z]' or lv=len(@address)) ) data ) data where rn=1 and sim > 0.92 order by length desc, sim desc ) name left join PAF.dbo.Localities on name.posttown=Localities.posttown left join PAF.dbo.LocalityAliases alias on name.posttown=alias.name RETURN END

The next problem I encountered was that in some addresses two postal cities were indicated at once: the smaller and the larger; - although both had their own postal district: for example, “Hessle, Hull”. (Hessl belongs to the district of HU13, and to Hull - the districts from HU1 to HU12.) In these cases, the second city should simply be ignored.
Sometimes the postal city was not indicated at all in the address: only the name of the district. Then we return the entire address as a remainder.

 CREATE FUNCTION dbo.Address_GetPosttownId_2(@address varchar(160), @town varchar(60)) RETURNS @res TABLE(local_id int, posttown varchar(30), leftover varchar(160), sim float, byalias bit) AS BEGIN insert into @res select local_id, posttown, @address+' '+leftover, sim, byalias from Address_GetPosttownId(@town); -- multiple posttowns in address? insert into @res select distinct local_id, posttown, new.leftover, sim*.95, byalias from (select distinct leftover from @res) lo cross apply Address_GetPosttownId(dbo.rtrimna(leftover)) new -- also try without a posttown (rare) INSERT INTO @res VALUES(null, null, @address+' '+@town, .8, 0) RETURN END

The locality recognition is completed by the search for the name of the district / suburb in the “Address” field and the remainder of the “City” field. The code there turned out to be quite long, so I will only retell its logic. If the postal city is still unknown, try to look for a suitable area in the entire base; if known, only among the districts of this city. If the city is found by a pseudonym, or it has unnamed areas, then the name of the area may be empty: we return all the received address as a remainder. As before, it was necessary to provide for an inaccuracy that is often found in addresses — an indication of the name of a district where it is not necessary to indicate it by postal rules. Therefore, if a city with nameless areas is transmitted, and the tail of the address is similar to one of its named areas - we will bite off the name of the area from the address, but we will return the nameless identifier. I admit that my code completely ignores the names of the subareas: in real addresses they are quite rare.

Street Recognition

The next stage is street recognition. We are looking for among all the streets of the found settlement firstly coinciding exactly with the tail of the rest of the address, then approximately the same. If it was not possible to find even approximately the same one, and the search was carried out in an unnamed area, we will try to search in all areas of the same city: it is possible that the address forgot to indicate the area that is required by postal rules. If there is no similar street in any district, we’ll check one more opportunity before we finally give up: perhaps there are addresses without a street in the selected district. Such addresses are rare, so we check them last. We do not check the aliases of streets from the Alias database at all: they turn out to be even rarer.

 CREATE FUNCTION Address_GetThoroughfare (@local_id int, @address varchar(160)) RETURNS @res TABLE(local_id int, thfare varchar(30), thfare_id int, thdesc_id int, leftover varchar(160), sim float) AS BEGIN INSERT INTO @res select top 1 @local_id, thfare, thfare_id, thdesc_id, leftover, 1 from ( select distinct thfare, thfare_id, thdesc_id, left(@address,len(@address)-lv) as leftover, lv as length from PAF.dbo.ThoroughfareIndex, dbo.Generate(len(@address)) l where left(right(@address,lv),1) like '[0-9A-Za-z]' --word boundary and (left(right(@address,l.v+1),1) not like '[0-9A-Za-z]' or lv=len(@address)) and local_id=@local_id and thfare=upper(right(@address,lv)) ) data order by length desc; IF @@rowcount!=0 RETURN; INSERT INTO @res select top 1 @local_id, thfare, thfare_id, thdesc_id, leftover, sim from ( select thfare, thfare_id, thdesc_id, leftover, length, sim, row_number() over (partition by thfare order by sim desc) as rn from ( select distinct thfare, thfare_id, thdesc_id, left(@address,len(@address)-lv) as leftover, lv as length, dbo.JaroWinkler(thfare,upper(right(@address,lv))) as sim from PAF.dbo.ThoroughfareIndex, dbo.Generate(len(@address)) l where left(right(@address,lv),1) like '[0-9A-Za-z]' --word boundary and (left(right(@address,l.v+1),1) not like '[0-9A-Za-z]' or lv=len(@address)) and local_id=@local_id ) data ) data where rn=1 and sim > 0.92 order by length desc, sim desc; IF @@rowcount!=0 RETURN; --try dependents INSERT INTO @res select dep.local_id, thfare, thfare_id, thdesc_id, leftover, sim from PAF.dbo.Localities parent join PAF.dbo.Localities dep on parent.posttown=dep.posttown and dep.dependent1 is not null cross apply Address_GetThoroughfare(dep.local_id, @address) where parent.local_id=@local_id and parent.dependent1 is null IF @@rowcount!=0 RETURN; --still no luck? if exists (select 1 from PAF.dbo.Addresses_ByThoroughfare(@local_id, null, null)) INSERT INTO @res VALUES (@local_id, null, null, null, @address, .8); RETURN END

Then, in the same way, only without searching in other areas, we are trying to recognize the street side. Most of the addresses are without a side street, so now the nameless version is always considered, and not least. The function code Address_GetThoroughfare_dependent I do not cite.

An important added bonus of using the Jaro-Winkler metric is that it did not need to make a list of abbreviations for street type names (Rd = Road, Ln = Lane, Ct = Court, etc.): the selected metric falls very weakly when the letters are missed. at the very end of the name.

Recognition of the addressee

The final stage is the recognition of the recipient of the mail, including the desired DPS, in the remainder of the “address” field. For most companies, the company name is a necessary element of the mailing address, so now we will add it to the address. Recipient recognition logic is the most convoluted part in the whole code. First of all, we erase the words “the” from the resulting address: the difference in their placement is the most common discrepancy between the names of houses and companies in PAF and in real addresses. We try to find the tail of the address among the names of the houses, as well as among their parts: if the address “15-17 Railway Road” is defined in PAF, then we will compare the tail with the lines “15-17”, “15”, “17”. When comparing with the names of houses, we also reject the initial “Flat”: in PAF it is usually spelled before the apartment number, in real addresses it is often not.
If the tail is not similar to any of the house names, try to find it among the company names: quite often the company name is the “implicit” house name. For comparing the names of companies, the threshold metrics had to be reduced to 0.85 - the discrepancies between the spelling of the name in PAF and the address turned out to be so great.
The last comparison, in case there is no coincidence with the names of houses or with the names of companies - with the names of parts of the house: sometimes, when both the house and the extension have their own names, only the extension name was indicated in the address.
And again, I confess that the pseudonyms of houses from the Alias base were not considered: it turned out that the addresses almost always use the “main” names of the houses.

We combine the written functions into one big query:

 CREATE FUNCTION Address_MatchDPS (@CUNAME varchar(40), @AD_ADDRESS varchar(160), @AD_ADDRESS_USER1 varchar(30), @AD_ADDRESS_USER2 varchar(30), @AD_POSTCODE varchar(12)) RETURNS TABLE AS RETURN select distinct postcode, dps, oname from Address_GetCounty(dbo.rtrimna(@AD_ADDRESS_USER2)) cross apply Address_GetPosttownId_2(@AD_ADDRESS, dbo.rtrimna(@AD_ADDRESS_USER1+' '+leftover)) address cross apply Address_GetPosttownId_dependent(local_id, byalias, dbo.rtrimna(address.leftover)) dep cross apply Address_GetThoroughfare(dep.local_id, dbo.rtrimna(dep.leftover)) tf cross apply Address_GetThoroughfare_dependent(tf.local_id, thfare_id, thdesc_id, dbo.rtrimna(tf.leftover)) dtf cross apply Address_GetBuilding(tf.local_id, thfare_id, thdesc_id, dbo.rtrimna(@CUNAME+' '+dtf.leftover)) bu where postcode=isnull(@AD_POSTCODE,postcode) --AD_POSTCODE=NULL disables check

According to the client, he finds suitable addresses in PAF, and returns the found indices, DPS, and company names. If there are multiple entries for a single customer, the caller will try to filter them by company name. (After all, if the address was both the name of the house and the name of the company, then Address_GetBuilding compared only the name of the house; and other companies could be located in the same house.)
Select distinct in Address_MatchDPS is necessary because the same mailbox could be found in various ways — for example, by discarding different elements of the source address.

The resulting function worked at a speed of about 500 addresses per hour, with most of the work, apparently, related to the function calls JaroWinkler; on a four-processor server, running four instances of the script allowed to quadruple performance. Successfully recognized about 60% of the addresses; among the remaining ones there were both “too non-standard” forms, and simply incorrect or incomplete, or with an erroneous index. Now employees of a small English company should look through all unrecognized addresses manually, and correct inaccuracies: the database of exact addresses of all customers is worth the effort.

Source: https://habr.com/ru/post/106207/

All Articles