In modern computational linguistics, bigrams, or in general n-grams, are an important statistical tool. In the article, we will describe what difficulties can be encountered when calculating bigrams on a large body of texts and present an algorithm that can be used on any home computer.
A bigram is two words that are adjacent in the text or, in our case, in the corpus of texts. For example in the sentence:
It was a hot summer.
1 2 3 4 5 (space after the "summer" is not a typo or error)
There will be such bigrams:
- It was
- it was hot
- hot summer
- summer .
Strictly speaking, bigrams consist not of words, but of tokens. Token, i.e. an indivisible unit within our task, will be either a word or a punctuation sign. Tokenization is not an easy topic, so we will assume that our corpus is already tokenized and split into sentences, and in order to convert a sentence into a list of tokens, it is enough to split it by a space.
')
Our task will be to get this list:
- It was 190360
- it was hot 1017
- hot summer 2621
- summer . 42146
where the number shows how many times the bigram occurs in the whole body.
Sometimes in the text we allow ourselves to use the term double-combination as a synonym for the word digram.
In this article, we intend to omit all implementation details, since The approach is interesting in itself and is easy to implement in your favorite programming language. In addition, the implementation contains a sufficient amount of interesting details to talk about it in a separate article. The minimum number of explanatory remarks will be given in Java.
Naive approach
- run on the body
- from each sentence highlight digrams
- read them using a multiset data structure (in Java it is Multiset <String> or ConcurrentHashMultiset <String> in a multithreaded version)
On a relatively small and clean case, it may work, but in the general case, with a naive approach, your memory will run out before you have time to calculate the entire array of texts.
Add intermediate clipping
A naive approach is very easy to turn into a worker, if you modify it a bit:
- run on the body
- from each sentence highlight digrams
- read them using multisets
- as soon as we see that the memory is running out, we clear the counter, deleting the biggrams that have met at this point less than the threshold number of times
This modification gives a completely working algorithm, but the cut-offs bring one trouble: along with unnecessary noise, like irregular typos and errors, we will remove a lot of useful information about rare words. For example, if a lexeme (set of word forms) is found 1000 times in a case, then each of its separate word forms may occur, say, less than 200 times per case, and what to speak of two combinations.
Our task is to count the bigrams as honestly as possible, without using intermediate cuts.
Use the disk as a temporary memory.
RAM now is relatively inexpensive, but still worth it. In addition, many versions of laptops more than 16 gigabytes, you do not install with all the desire - a platform restriction. There are no problems with disk space - it costs much less and if you want you can always use an external drive.
At the mention of the semantic tags # hard_disk and # algorithm, merge sort algorithms and ordered lists unions, which many people wrote in Pascal at school, pop up in memory. The ideas behind these algorithms are well suited for solving our problem.
The basic scheme of solving the problem
Before proceeding to the details, we present a schematic diagram of the solution of the problem:
- Break the case into approximately equal blocks, say 1 gigabyte.
- Calculate the bigrams separately for each block, sort them in lexicographical order and write to disk.
- Using the merge algorithm of ordered lists, merge individual files with two combinations into one, summing up the number of entries for matching digrams.
The size of each block can be chosen according to taste (read: according to the amount of installed RAM), but in most tasks the size in gigabytes is more than convenient. You can also work with a monolithic case, making the program cut-off according to the size of the processed text, throwing the result to disk and clearing the data structures.
Looking at the algorithm from a bird's eye view, you can go to the details.
We consider bigrams for each block
In order to build an optimal architecture for the two-counting counter, we consider two important requirements:
- We want to count in several streams.
- At the output you need to get a list of digrams sorted in lexicographical order.
So it turns out that these two tasks can be effectively addressed together. It is proposed instead of using a single-level card (a multiset is essentially a key-counter card)
ConcurrentHashMultiset<String>
for counting bigrams use a map card:
ConcurrentMap<String, ConcurrentHashMultiset<String>>
The experiment shows that multithreaded calculation of two-combinations using both data structures is performed in approximately the same time, while sorting using a two-level map is much faster, because You can sort the keys of the external and internal cards independently.
Another major advantage that the use of a two-level map gives is that you can very quickly do additional filtering according to digrams. For example, check their entry in the dictionary or even perform normalization (fast driving -> fast driving). It is very costly to perform these operations before you have grouped the same two combinations. the same operation will be performed many times.
Combining results for all blocks
So, at the output of the previous algorithm, we have a set of files with records of the form:
bigram1 count1 bigram2 count2 ... bigramN countN
where the keys are sorted in lexicographical order. Our task is to combine these files into one, summing up the number of entries for matching keys. The task of summing up two files will be considered trivial and leave without additional explanation.
The general problem of combining all files can be solved using a battery file by successively adding individual block files to it:
((((((N1 + N2) + N3) + N4) + N5) + N6) + N7)...
However, this campaign is ineffective, because After a certain number of iterations, we will add relatively small files of separate blocks to a relatively large battery and spend most of the time reading data from the disk and writing to the disk. It is much more profitable to build such a strategy, where at each iteration blocks of approximately the same size will be summed up, because the matching keys will collapse into one record and the final file will be smaller than the sum of the two original ones.
The backbone of merge sorting using recursion is excellent for implementation. Schematically, for 15 files it will look like this (for the merge function, the first index is included, the second is excluded):
| _ merge (0, 15) = merge (0, 7) + merge (7, 15)
| _ merge (0, 7) = merge (0, 3) + merge (3, 7)
| _ merge (0, 3) = 0 + merge (1, 3)
| _ merge (1, 3) = 1 + 2
| _ merge (3, 7) = merge (3, 5) + merge (5, 7)
| _ merge (3, 5) = 3 + 4
| _ merge (5, 7) = 5 + 6
| _ merge (7, 15) = merge (7, 11) + merge (11, 15)
| _ merge (7, 11) = merge (7, 9) + merge (9, 11)
| _ merge (7, 9) = 7 + 8
| _ merge (9, 11) = 9 + 10
| _ merge (11, 15) = merge (11, 13) + merge (13, 15)
| _ merge (11, 13) = 11 + 12
| _ merge (13, 15) = 13 + 14
As a result, the algorithm will do the same 14 mergers, but it will work out with the battery option much more efficiently. There are theoretical memory requirements for the O (1) merge algorithm, but in practice, memory is allocated only for read and write buffers.
Finally
Using the above approach, it is possible to assemble not only bigrams along the body, but also n-grams for any n. Although there may have to use blocks of smaller size and more often throw off intermediate results to disk.
As we said at the beginning, the implementation details deserve a separate discussion, and we will describe them in the next article.