Excel Lemmatization, or "Robot Recognizer 3.0"

Who worked with online advertising, the ~~one in the circus does not laugh~~ knows that search engines sometimes give unexpected answers to queries or throw up completely different ads that may be interesting. In the latter case, the root of the problem often lies in the set of keywords that the advertiser uses in his campaigns. The thoughtless automation of the selection of keywords makes for sad consequences, among which the most depressing are empty impressions and clicks. Excel-inventor and rationalizer Realweb Dmitry Tumaykin puzzled by this problem and created another file-robot, which is happy to distribute to the world and Habra. We hand over the word to the author again.

“In my previous article, we talked about the clustering of large semantic kernels using macros and formulas in MS Excel. This time it will be about even more interesting things - word forms, lemmatization, Yandex, Google, Zaliznyak's dictionary and again about Excel - its limitations, ways to bypass them and incredible speeds of binary search. The article, like the previous one, will be of interest to contextual advertising specialists and SEO specialists.

So where did it all start?

As you know, the key difference between Yandex search algorithms and Google search is the support of the Russian language morphology. What is meant is: one of the biggest advantages is that in Yandex.Direct it is enough to set one word form as a negative word (any), and the ad will not be shown on any of all its word forms. I put the word “free” in the negative keywords - and there will be no hits according to the words “free”, “free”, “free”, “free”, etc. Conveniently? Of course!
')
However, not all so simple. On the oddities of the morphology of Yandex, more than one article was written, including posts at Habré himself, and I, in the course of my work, repeatedly encountered them. The controversy continues to this day, but I think this algorithm, despite everything, can be considered an advantage over Google’s logic.

The strangeness lies in the fact that if a group of all word forms, say, a verb or adjective, contains a homonym with a group of noun form words, then Yandex actually “sticks together” them into some single set of word forms, all of which will show your ads.
Here is a vivid example, forgive me NashLos :

As it is clear to all, the short form of the adjective “great”, “led and to”, is a homonym of the word “in its face”, which, in turn, is a synonym for the word “bicycle”. Naturally, the word “great” does not have the word form itself, so it is obviously wrong to show it on this request. Speaking the language of linguists, Yandex confused paradigms.

The situation in the natural output is much better, perhaps, the algorithms there are more complex and optimized. Or maybe this is due to higher competition, because SEO is conditional-free (if the state has its own webmaster), and there are an order of magnitude less hunters to pay for each click in Direct, despite the positive trends in contextual advertising. Maybe the VCG auction will fix everything? Wait and see.

However, the corporation has its own quirks. Google ~~speaks poorly in Russian~~ almost does not recognize word forms. Unlike Yandex, in AdWords, you need to exclude all word forms from negative keywords (which is already confusing). At the same time, the number of excluded elements at the level of one campaign is not more than 5,000 for the internal restrictions of the system, and the total for all campaigns in the account is no more than 1 million. It would seem that there should be enough of them, but I am sure that It does not seem so.

In general, the conclusion that I made for myself, working with Direct and AdWords - to achieve maximum results, you will have to dig into word forms, whatever tool you use. Therefore, I needed a complete database of word forms, preferably the closest to the algorithms of Yandex. I was incredibly happy when I found out about the still-living, and God bless him with good health and long life, Zaliznyak Andrew Anatolyevich , who created such a dictionary. This dictionary contains about 100,000 semantic paradigms, the most "multifaceted" of which 182 word forms. In total, the entire dictionary is just over 2.5 million words. He formed the basis of a variety of morphology recognition systems. It is this dictionary in electronic form that I found on the Internet and successfully integrated into Excel for business needs.
Curious people may wonder - why was it necessary to insert 2.5 million words into Excel?

~~And I have 5 reasons for this~~ , I answer:

First, it was just curious what the dictionary was. The fact is that Yandex started supporting word forms, taking it as a basis and using it as a database. Then, of course, Yandex programmers have made significant progress, as can be seen from the latest version of Mystem , in which there is an algorithm for removing homonyms, which is written above (as I understand it, the algorithm recognizes parts of the speech of nearby words, and based on this information, builds assumptions about speeches of the original "polysemous" word). Nevertheless, the main competitive advantage of our Internet giant is the support of the “great and mighty” morphology - the result of the work done and partly the work of an 80-year-old professor.
Free lemmatizers (for example, from K50 or Andrey Kashin ) with a simple interface, known to me and being in the public domain, do not meet my requirements, because their issuance does not match the algorithms of Yandex. And to me, not being their developer, this situation can not be corrected.
Since most of the text processing takes place in Excel spreadsheets, and web interfaces may not always be accessible or “slow down” on large amounts of data, it’s more convenient for me to have all the tools “at hand” locally.
“Robot Recognizer 1.0” without built-in normalization was not good for anything, and I realized this myself. What is the point of a contextual advertising specialist to cluster an unnormalized core? You still have to go to the web interface, normalize the requests, copy and then process it in Excel.
After I discovered the binary search in Excel, I wanted to try it out in action, on really large amounts of data. And than 2.5 million cells are not large for MS Excel?

Lemmatizer birth and binary search

That's why I decided that I would create my lemmatizer, with blackjack with notorious macros and formulas. Along the way, I’ll make it clear: lemmatization is the process of reducing the lemma to a lemma — the initial vocabulary form (infinitive for a verb, the nominative singular for nouns and adjectives).

The result of the effort can be downloaded from the link: Robot Recognizer - 3

Visually, the file is almost the same as the previous version. The only difference is that it adds two additional sheets (dictionary) and a macro that performs a search on them, and returns the initial form. Since the limitations of Excel are 2 to the 20th power of lines minus one line (a little over a million), we had to divide the dictionary into 2 sheets and make a macro based on this feature. Initially it was assumed that the data will take 3 sheets, but fortunately, in the dictionary turned out a decent amount of duplicates. They are duplicates for the computer, for a person it can be different word forms of different paradigms.

At the heart of the file is an array of Excel by the standards of the Excel file. Processing such an array of data requires large resources and can be quite slow. This problem was just solved by a binary (binary) search in Excel, which I mentioned at the beginning. The linear search algorithm can run through all 2.5 million + records line by line - this will take a very long time. Binary search allows you to process arrays of data very quickly, since it performs four basic steps:

The data array is divided in half and the reading position moves to the middle.
The found value (let n) is compared with the one we are looking for (let m).
If m> n, then the second part of the array is taken, if m <n is the first part.
Next steps 1-3 are repeated on the selected part of the data array.

Simply put, the binary search algorithm is similar to how we search for a word in a dictionary. Open the dictionary in the middle, we look, in which of the halves will be the word we need. Suppose in the first. Open the first part in the middle, continue to half until we find the right word. In contrast to the linear, where you need to do 2 to 20 degrees of operations (with a maximum column fullness in Excel), with a binary one you need to do only 20, for example. Agree impressive. You can verify the speed of the binary search by working with the file: it has 3 million cells searching for each of the words in the queries in seconds.

All formulas and macros work only in the original file, and will not work in others. And further. If you supplement the dictionary in the file, then before processing the file, you must sort the dictionary in alphabetical order - as you already understood, the logic of binary search requires this.

Of course, to call the solution the most elegant will not work at least because of the use of a huge formula of 3215 characters. Those who want to see it firsthand and try to understand the logic can go in and see.

Look

TRIMS (IF (substring (A1; ""; 1) <"m"; ERROR (IF (RRF (substring (A1; ""; 1); AL)! $ A: $ B; 1; 1) <substring (A1; ""; 1); substring (A1; ""; 1); RRD (substring (A1; ""; 1); 'AL-L! $ A: $ B; 2; 1)) ; ""); ERROR (IF (RRF (substring (A1; ""; 1); 'M-Ya'! $ A: $ B; 1; 1; 1) <substring (A1; ""; 1); substring ( A1; ""; 1); RR (substring (A1; ""; 1); 'M-Ya'! $ A: $ B; 2; 1)); "")) & "" & IF (substring (A1 ; ""; 2) <"m"; IFERBY (IF (RRD (substring (A1; ""; 2); 'AL-L! $ A: $ B; 1; 1) <substring (A1; "" ; 2); substring (A1; ""; 2); CDF (substring (A1; ""; 2); 'A-L'! $ A: $ B; 2; 1)); ""); ERROR ( IF (CDF (substring (A1; ""; 2); 'M-Ya'! $ A: $ B; 1; 1) <substring (A1; ""; 2); substring (A1; ""; 2) ; CDF (substring (A1; ""; 2); 'M-I'! $ A: $ B; 2; 1)); "")) & "" & IF (substring (A1; ""; 3) < "M"; ERROR (IF (RR (substring (A1; ""; 3); 'AL-A! $ A: $ B; 1; 1) <substring (A1; ""; 3); substring (A1 ; ""; 3); RR (substring (A1; ""; 3); 'AL-L! $ A: $ B; 2; 1)); ""); IF ERROR A (IF (RR (substring (A1; ""; 3); 'M-Ya'! $ A: $ B; 1; 1) <substring (A1; ""; 3); substring (A1; ""; 3); CDF (substring (A1; ""; 3); 'M-Ya'! $ A: $ B; 2; 1)); "")) & "" & IF (substring (A1; ""; 4 ) <"M"; IFBLUE (IF (MUT (substring (A1; ""; 4); 'A-L'! $ A: $ B; 1; 1) <substring (A1; ""; 4); substring (A1; ""; 4); CDF (substring (A1; ""; 4); 'A-L'! $ A: $ B; 2; 1)); ""); IFBLUE (IF (MUT (substring (A1; ""; 4); 'M-Ya'! $ A: $ B; 1; 1) <substring (A1; ""; 4); substring (A1; ""; 4); CDF (substring (A1; ""; 4); 'M-Ya'! $ A: $ B; 2; 1)); "")) & "" & IF (substring (A1; ""; 5 ) <"M"; ERROR (IF (CDR (substring (A1; ""; 5); 'AL-L! $ A: $ B; 1; 1) <substring (A1; ""; 5); substring (A1; ""; 5); RR (substring (A1; ""; 5); 'A-L'! $ A: $ B; 2; 1)); ""); IF ERROR (IF (RR (substring (A1; ""; 5); 'M-Ya'! $ A: $ B; 1; 1) <substring (A1; ""; 5); substring (A1; ""; 5); RR (substring ( A1; ""; 5); 'M-Ya'! $ A: $ B; 2; 1)); "")) & "" & IF (substring (A1; ""; 6) <"m"; ERROR (IF (CDF (substring (A1; ""; 6); 'A-L'! $ A: $ B; 1; 1) <substring (A1; ""; 6); substring (A1; ""; 6 ); CDF (substring (A1; ""; 6); 'AL-L! $ A: $ B; 2; 1)); ""); ERROR (IF (CDF (substring (A1; ""; 6 ); 'M-I'! $ A: $ B; 1; 1) <substring (A1; ""; 6); substring (A1; ""; 6); CWD (substring (A1; ""; 6) ; 'M-Ya'! $ A: $ B; 2; 1)); "")) & "" & IF (substring (A1; ""; 7) <"m"; IF ERROR (IF (B PR (substring (A1; ""; 7); 'A-L'! $ A: $ B; 1; 1) <substring (A1; ""; 7); substring (A1; ""; 7); CDF (substring (A1; ""; 7); 'A-L'! $ A: $ B; 2; 1)); ""); IF ERROR (IF (RRF (substring (A1; ""; 7); ' M-Ya '! $ A: $ B; 1; 1) <substring (A1; ""; 7); substring (A1; ""; 7); RRD (substring (A1; ""; 7); 'M-Ya'! $ A: $ B; 2; 1)); "")) & "" & IF (substring (A1; ""; 8) <" m "; ERROR (IF (RR (substring (A1;" "; 8); '-'! $ A: $ B; 1; 1) <substring (A1;" "; 8); substring (A1; ""; 8); RLL (substring (A1; ""; 8); 'AL-L! $ A: $ B; 2; 1)); ""); If the ERROR (IF (RR (substring (A1; ""; 8); 'M-I'! $ A: $ B; 1; 1) <substring (A1; ""; 8); substring (A1; ""; 8); RPS (substring (A1; " "; 8); 'M-I'! $ A: $ B; 2; 1));" ")) &" "& IF (substring (A1;" "; 9) <" m "; ERROR (IF ( RR (substring (A1; ""; 9); 'A-L'! $ A: $ B; 1; 1) <substring (A1; ""; 9); substring (A1; ""; 9); RR (substring (A1; ""; 9); 'A-L'! $ A: $ B; 2; 1)); ""); IF ERROR (IF (RRF (substring (A1; ""; 9); ' M-Ya '! $ A: $ B; 1; 1) <substring (A1; ""; 9); substring (A1; ""; 9); MWP (substring (A1; ""; 9);' M -I '! $ A: $ B; 2; 1)); "")) & "" & IF (substring (A1; ""; 10) <"m"; IF ERROR (IF (CDR (substring (A1; " "; 10); 'A-L'! $ A: $ B; 1; 1) <substring (A1;" "; 10); substring (A1;" "; 10); RPS (substring (A1;" " ;ten); 'A-L'! $ A: $ B; 2; 1)); ""); IF ERROR (IF (RRP (substring (A1; ""; 10); ME)! $ A: $ B; 1; 1) <substring (A1; ""; 10); substring (A1; ""; 10); RPS (substring (A1; ""; 10); 'M-Ya'! $ A: $ B; 2 ;one));"")))

However, the huge formula is not the only problem that had to be encountered during the work on the lemmatizer.

Zaliznyak's dictionary is an old edition (1977) and among the word forms there are not some of the most simple and familiar words for 2015, for example, “computer”. That is why Yandex is refining it, I am refining it and, if necessary, anyone can modify it. The problem is not fully resolved, but wait for the upcoming updates of the Robot - everything will be there.
There are no proper names in the dictionary - they also need to be added there. I worked on this and added names, countries of the world and cities of Russia.

The problem of the lack of modern words is solved by adding words collected from various open sources. In particular, at the time of publication of the post, a base of 300,000 commercial requests has already been collected, which will be compared with the base. Words that are missing in it will be added to the dictionary in the necessary word forms. It may seem that 300 thousand words are not many, but believe me, this is enough for a significant expansion of Zaliznyak’s dictionary.

In addition, the “Recognizer Robot 2” will not contain the aforementioned errors of other lemmatizers, in which, for example, “Avito” is considered a word form and returns the verb “avit” and numerous word forms of this non-existent verb are generated.

PS: Wishes and bug reports are welcome.

Now Dmitry is working on another tool that will perform inverse operations: generate word forms of given words, and not return the lemma. We are waiting for another stream of macros and giant formulas. Along with contextual advertising automation systems, in RealWeb, we actively use Excel recognizer robots - this is a serious help in working with the semantic core necessary for working with the web as a whole and with online advertising in particular. 1 - this is a serious help in working with semantic core necessary for working with the web in general and with online advertising in particular. We are sure that these tools will be useful to you too!

Source: https://habr.com/ru/post/265375/

All Articles

Excel Lemmatization, or "Robot Recognizer 3.0"

So where did it all start?

Lemmatizer birth and binary search

More articles: