📜 ⬆️ ⬇️

Competition for the classification of words from Hola or "where to get another percent?"

I saw a post about the competition when it was already two weeks after the start. But the task seemed extremely exciting, and I was not mistaken in this, diving into the solution with my head. I want to share the solution by 80 +% and my impressions in this post.

All my participation was questioned "where to get another percent?", But in response I often received hundredths of a percent or nothing. So, first things first.

I wrote in C ++, because I don’t know JS at all, but I translated the ready-made solution to JS with the help of a colleague.

The first thing I tried was a bloom filter. I shortened the dictionary: I reduced all the words to the lower case, replaced all the words in the dictionary ending in 's with the prefix, and removed the 60 kb filter for the bloom. As a hash function, I chose the linear generator h [i] = 115249 * (h [i-1] + w [i]) + 33391. As a result, the accuracy was not very high, of the order of 60%, but with something to start .
')
The second. Added filtering by rare combinations of characters: does not contain 6 vowels in a row, does not contain 6 consonants in a row. If the word length is more than three, then three identical letters should not go in a row.

Third. I looked at the word / garbage ratio depending on the length and began to return the result “garbage” to all tests longer than 13 characters.

Fourth. Junk contains more words with quotation marks. A new rule: for everything that contains a quotation mark (not counting 's), say “not a word”. After that, the accuracy was in the region of ~ 74%.

The fifth. I considered the probabilities of all the bigram and began to throw out all the words that have little-probable bigrams. I calculated the frequency of bigram for all words from the dictionary and began to consider the probability that this word or a random sequence of bytes. I used three values ​​based on the probabilities of the bigram: their sum; logarithm of their product and the sum of their roots. Built graphics in gnuplot, turned out beautiful pictures.

image

I was very happy with the graphs I saw, but in the end it gave only 76.4% after manual adjustment of the coefficients. Among the shortcomings: minus 1400 bytes for bloom.

The sixth. Constructed graphs on the frequency of letters, took the same amount / logarithm of the product. The quality could not be improved, the bigrams have already given good filtering.

Seventh. I noticed that the words from trash are generated unevenly and repeats much more often. Added a stop list from the top (20 pieces). Memory went to the detriment of the bloom filter, but the total accuracy increased by + 0.1-0.2%. At this point, I thought that the potential of the bloom filter is not yet fully used.

Eighth. I decided to save in the bloom filter not words, but only prefixes of length 3-6. They are much smaller due to the repetitions and the accuracy of the bloom filter has greatly increased. 77.8%. I decided that you can not store a stop list of trash, and save it directly to the filter. I got an integer array, added 1 for each word, reduced for every repeated trash by the number of repetitions, then set only those bits in the filter where there was a positive number. I stopped setting bits on those words that are not tested for frequency by bigrams or others: a place is made in the bloom filter. At the same time raised the border of paragraph 3. Total received 79% +

Ninth. I tried to count trigrams. To save memory, he took into account only the first and third symbols, constructed frequencies / graphics similarly to the fifth paragraph. Accuracy increased by 0.05%, but reducing the filter size by another 1500 characters caused a regular decrease in accuracy. As a result, I refused this item.

Tenth. I noticed that the probabilities of the bigram depend on the position in the word. This is most obvious for the last two characters and for the first two characters. Added relevant heuristics. But at the same time "the word does not end with Q". But the ideas borrowed from linguistics and rules of word formation (ing form, plural, etc.) gave only impairments.

Eleventh. Many customized all the constants in the code. I think that it would be more correct to write some kind of automated tool, but there is not much time left. Total took the size of the bloom filter 502122 bits; it contained prefixes 8 characters long; and filtered by word length> 18. It turned out 80.5 +%

Twelfth. Garbage that is not random looks like misspelled words. Therefore, an idea appeared: to add one error to the test and to check whether the garbage or something similar to the word turned out? I collected a bunch of statistics, conducted a bunch of experiments, but a zero or negative result. Has anyone been able to bring such a check to mind?

Thirteenth. Drew attention to the unevenness of non-words. They appear either too often or only once (on a sample of a reasonable size). Began to save all requests and after 1.5 million to check whether it was already? If you have never been before - it means garbage. If earlier there were more than 5, it means also garbage. Well, at the same time every 10 million words thinned his array by deleting unique words so that the memory does not end. Total for every million I got an increase in accuracy of about 1%. For example, in a sample of 3,500,000 words, I received 84%, and by 5,000,000 more than 85%.

At this time, it came to an end, every time I started to twitch my eyes, that the accuracy drops by 1-5% instead of growth, and there are no really good ideas. I wonder if it is possible to score 83-85% without the 13th point? With every tenth and hundredth mined, these numbers seem to be more and more real, but still far away. As far as 80% on the second day.

The final solution is here: github.com/knst/word-classifier

For JS, you need to prepare the data, after running the C ++ program, do:

$ cat bigrams.bin bloom.bin > data && gzip data 

Source: https://habr.com/ru/post/302256/


All Articles