General greetings.
Recently, for polishing a morphological dictionary capable of (supposedly) generating all possible forms of a word from an infinitive - I needed a rather voluminous frequency dictionary of the Russian language. The frequency dictionary is a very simple thing, the words in it are ordered by the frequency with which they are found in the analyzed text.
The problem seems to be very simple and certainly solved when studying programming in the first rows. But for the analysis of such a cumbersome library, and the library I use extends to 157 hectares, the means of an average home computer are lacking. More precisely, the library is stored in fifty zip archives of 0.5 - 2 GB, each of which contains 20-30 thousand works in the fb2 format. In compressed form, it weighs 60 GB.
Solved the problem in the c # language. The result should be obtained in adequate time, for which I chose no more than 8 hours, so that you can start the performance in the evening and get the result in the morning.
')
Finding a solution
Array
The most obvious method of solving what is called “head-on” is two arrays, in the first - words, in the second - a number. Having met a new word, we add it to the first array, if it is missing there or we add one to the index in the second array, if we have found the word in the first. After testing this option, I immediately became disappointed in it, after a few hours the program creaked crept through the first half of the first archive. Any professional programmer probably already laughs at this approach. However, I confess - I did not even imagine that a trick was waiting for me.
Then I began to look for - where is the bottleneck that does not allow the program to breathe. A bottleneck appeared at the moment of adding a new word. To keep an array ordered, the word has to be inserted somewhere in the middle, and sometimes at the very beginning. To do this, you have to move each element of the array located to the right of the selected position to the right. When it comes to millions of words, this activity becomes very tedious for the processor and it takes revenge by postponing the completion of the program for the weeks ahead.
Ordered list
I had to look for a data structure that would allow each element to be added simply to the end of this structure, but at the same time would allow them to be streamlined. Then I stumbled upon ordered lists. A piece of data and a pointer to another cell are stored in the cell of the ordered list. Everything is excellent, we just add a new word and change the 2 indexes, the index of the previous word, indicating the given one and the index of this word, indicating the following. But, here's a bad luck, the search in such a list is very computationally demanding. If in an ordered array we could start the search from the middle and divide it in half for one iteration, then in the ordered list we have to climb the pointers from one to another, like a thread, until we find the right place in the whole coil. I did not try to try this option. After learning from a previous failure, I immediately caught on an ambush.
Binary search tree
I found the following data structure only after a few hours. I came across binary search trees. Having read a bit about various options, I stopped at the self-balancing AVL tree created by the way Soviet scientists Adelson-Velsky Georgy Maksimovich and Landis Evgeny Mikhailovich, and inherited their name from their names.
The structure of a binary tree is similar to an ordered list, but each element, except for a few terminal (so-called leaves) refers to two elements. The root element is the one that would be in the middle of an ordered array. He refers to the element smaller (left) and larger (right), in addition - guaranteed that all elements of the left branch will be less than this, and all elements of the right branch - more. Having considered the left or right element to which we arrived - we will see the same hierarchy, it also refers to two elements, the left branch is also smaller, the right one - more. To understand the method of balancing such a tree, it is best to read the relevant article, for example, here:
AVL-trees . It is important to note that the binary search tree fully satisfied my requirements. Quick search and quick add a new item.
Result
In the end, after spending a few more hours on optimization, I received a multi-threaded application that unpacks the archive into RAM, counts the frequency of various words and processes the data using the AVL tree.
Here are a few lines on the results of the program, on the left - the word, on the right - the frequency. These are the most common words of various lengths:
- I 34361402
- a 36192092
- from 52479822
- at 126422246
- and 158458227
...
- on 23978868
- he is 32602346
- then 42163693
- at 83001625
- not 97183097
...
- all 19576264
- that's 21318968
- his 27719894
- like 30228770
- that 63106338
...
- even 6789652
- was 6832494
- if 7750404
- I am 12381969
- was 15450767
...
- maybe 5455561
- very 5477013
- time 6317279
- when 9788599
- to 9987225
...
- misanthropy 296
- highly skilled 350
- Excellency 384
- Excellencies 489
- Excellency 3739
...
- misanthropic 46
- misanthropic 52
- private business 60
- Excellency 91
- national socialist 96
In total, 9.5 million words were received, the analysis lasted 8482 seconds, the average processing speed of the unpacked text was 18.57 mb / s.
Now you can use the data to refine my morphological dictionary, and having acquired a dictionary you can improve the “frequency analyzer”, since It will be possible to group the words with the same root. In addition, the work requires the work of a “frequency analyzer” with various encodings, etc. Next - parse sentences. In the end, I want to get some adequate semantic network.
Despite the fact that my “report” affects both the topic of programming and linguistics - please do not blame me for inaccuracies in writing (especially punctuation) or not the smoothest solution to the problem, but please indicate these errors or suggest more elegant solutions, I will be glad.
Thanks to all.