This article describes how to generate pseudo-text based on the trigram model. The resulting text is hardly possible to use anywhere, nevertheless, this is a good illustration of the use of statistical methods for processing natural language. An example of the generator can be found
here .
Dry theory
And so, our task is to generate the text. This means we need to take the words and line them up in a certain order. How to determine this order? We can go as follows: to build phrases that are most likely for the Russian language. But what does the probability of a language phrase mean? From the point of view of common sense is nonsense. Nevertheless, this probability can be formally defined as the probability of the occurrence of a sequence of words in a certain corpus (set of texts).
For example, the probability of the phrase
“happiness is pleasure without repentance” can be calculated as the product of the probabilities of each of the words in this phrase:
P = P (happiness) P (is | happiness) P (pleasure | happiness is) P (without | happiness is pleasure) P (repentance | happiness is pleasure without)
')
Calculate the probability of
P (happiness) is a simple matter: you only need to count how many times this word has occurred in the text and divide this value into the total number of words. But calculating the probability of
P (repentance | happiness is pleasure without) is no longer so simple. Fortunately, we can simplify this task. Let us assume that the probability of a word in a text depends only on the previous word. Then our formula for calculating the phrase will take the following form:
P = P (happiness) P (is | happiness) P (pleasure | is) P (without | pleasure) P (repentance | without)
Already easier. Calculate the conditional probability P (is | happiness) is easy. To do this, consider the number of pairs 'happiness is' and divide by the number in the text of the word 'happiness':
P (is | happiness) = C (there is happiness) / C (happiness)
As a result, if we count all the pairs of words in a text, we can calculate the probability of an arbitrary phrase. And if we can calculate the likelihood of a phrase, we can choose the most “plausible" combinations of words in this text for automatic generation. By the way, these pairs of words are called bigrams, and the set of calculated probabilities is called the bigram model.
I want to note that for our algorithm we use not bigrams, but trigrams. The difference is that the conditional probability of a word is determined not by one, but by the two previous words. That is, instead of P (pleasure | happiness) we will calculate P (pleasure | happiness exists). The calculation formula is similar to the formula for bigrams:
P (pleasure | happiness exists) = C (happiness is pleasure) / C (happiness exists)
So, the generation of the text can be carried out as follows. We will form each offer separately by performing the following steps:
* choose the word most likely to start the sentence;
* we select the most probable word continuation depending on the two previous words;
* repeat the previous step until we meet the end of sentence symbol.
Practice
To begin with, we need to prepare the body on which we will train our model. I, for example, took a sample from Leo Tolstoy on the site lib.ru, and formed one text file (you can download it
here ).
From this text we select the sequence of words we need.
import re
r_alphabet = re . compile( u'[--0-9-]+|[.,:;?!]+' )
def gen_lines (corpus):
data = open (corpus)
for line in data:
yield line . decode( 'utf-8' ) . lower()
def gen_tokens (lines):
for line in lines:
for token in r_alphabet . findall(line):
yield token
lines = gen_lines( 'tolstoy.txt' )
tokens = gen_tokens(lines)
The resulting tokens generator will produce a “cleaned” sequence of words and punctuation. However, the simple sequence is not interesting to us. We are interested in triples of tokens (here a token is understood as a word or a punctuation mark, that is, some atomic elements of the text). To do this, we will add another generator, at the output of which we will have three successive tokens.
import re
r_alphabet = re . compile( u'[--0-9-]+|[.,:;?!]+' )
def gen_lines (corpus):
data = open (corpus)
for line in data:
yield line . decode( 'utf-8' ) . lower()
def gen_tokens (lines):
for line in lines:
for token in r_alphabet . findall(line):
yield token
def gen_trigrams (tokens):
t0, t1 = '$' , '$'
for t2 in tokens:
yield t0, t1, t2
if t2 in '.!?' :
yield t1, t2, '$'
yield t2, '$' , '$'
t0, t1 = '$' , '$'
else :
t0, t1 = t1, t2
lines = gen_lines( 'tolstoy.txt' )
tokens = gen_tokens(lines)
trigrams = gen_trigrams(tokens)
The gen_trigrams method requires clarification. The '$' characters are used to mark the beginning of a sentence. Further, it makes it easier to select the first word of the generated phrase. In general, the method works as follows: it returns three tokens in succession, shifting by one token at each iteration:
At the entrance:
'Happiness is pleasure without repentance'
At the exit:
iteration tokens
0: $ $ happiness
1: $ happiness is
2: happiness is pleasure
3: have fun without
...
Next, we calculate the trigram model:
import re
from collections import defaultdict
r_alphabet = re . compile( u'[--0-9-]+|[.,:;?!]+' )
...
def train (corpus):
lines = gen_lines(corpus)
tokens = gen_tokens(lines)
trigrams = gen_trigrams(tokens)
bi, tri = defaultdict( lambda : 0.0 ), defaultdict( lambda : 0.0 )
for t0, t1, t2 in trigrams:
bi[t0, t1] += 1
tri[t0, t1, t2] += 1
model = {}
for (t0, t1, t2), freq in tri . iteritems():
if (t0, t1) in model:
model[t0, t1] . append((t2, freq / bi[t0, t1]))
else :
model[t0, t1] = [(t2, freq / bi[t0, t1])]
return model
model = train( 'tolstoy.txt' )
In the first part of this method, we define generators. Next, we calculate the bigrams and trigrams (in fact, we count the number of identical pairs and triples of words in the text). Next, we calculate the probability of a word depending on the two previous ones, putting the given word and its probability in the dictionary. I must say that this is not the most optimal method, since there is a significant memory consumption. But for small cases this is quite enough.
Now we are ready to generate the text. The following function returns a sentence.
...
model = train( 'tolstoy.txt' )
def generate_sentence (model):
phrase = ''
t0, t1 = '$' , '$'
while 1 :
t0, t1 = t1, unirand(model[t0, t1])
if t1 == '$' : break
if t1 in ( '.!?,;:' ) or t0 == '$' :
phrase += t1
else :
phrase += ' ' + t1
return phrase . capitalize()
The essence of the method is that we consistently select the most likely words or punctuation marks until we meet the sign of the beginning of the next phrase (the $ symbol). The first word is selected as the most likely to start a sentence from the set model ['$', '$'].
Here it is necessary to note an important point. The model dictionary for each pair of words contains a list of pairs (word, probability). We need to choose only one word from this set. The option "in the forehead" - choose the word with the maximum probability. But then all the generated phrases would be similar to each other. A more suitable way is to choose words with a certain chaos, which would depend on the probability of the word (we do not want our phrases to consist of rare combinations). This makes the unirand method, which returns a random word with a probability equal to the probability of the given word, depending on the two previous ones.
Total, the full code of our generator is as follows:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
from random import uniform
from collections import defaultdict
r_alphabet = re . compile( u'[--0-9-]+|[.,:;?!]+' )
def gen_lines (corpus):
data = open (corpus)
for line in data:
yield line . decode( 'utf-8' ) . lower()
def gen_tokens (lines):
for line in lines:
for token in r_alphabet . findall(line):
yield token
def gen_trigrams (tokens):
t0, t1 = '$' , '$'
for t2 in tokens:
yield t0, t1, t2
if t2 in '.!?' :
yield t1, t2, '$'
yield t2, '$' , '$'
t0, t1 = '$' , '$'
else :
t0, t1 = t1, t2
def train (corpus):
lines = gen_lines(corpus)
tokens = gen_tokens(lines)
trigrams = gen_trigrams(tokens)
bi, tri = defaultdict( lambda : 0.0 ), defaultdict( lambda : 0.0 )
for t0, t1, t2 in trigrams:
bi[t0, t1] += 1
tri[t0, t1, t2] += 1
model = {}
for (t0, t1, t2), freq in tri . iteritems():
if (t0, t1) in model:
model[t0, t1] . append((t2, freq / bi[t0, t1]))
else :
model[t0, t1] = [(t2, freq / bi[t0, t1])]
return model
def generate_sentence (model):
phrase = ''
t0, t1 = '$' , '$'
while 1 :
t0, t1 = t1, unirand(model[t0, t1])
if t1 == '$' : break
if t1 in ( '.!?,;:' ) or t0 == '$' :
phrase += t1
else :
phrase += ' ' + t1
return phrase . capitalize()
def unirand (seq):
sum_, freq_ = 0 , 0
for item, freq in seq:
sum_ += freq
rnd = uniform( 0 , sum_)
for token, freq in seq:
freq_ += freq
if rnd < freq_:
return token
if __name__ == '__main__' :
model = train( 'tolstoy.txt' )
for i in range ( 10 ):
print generate_sentence(model),
Great, you got to the end. I envy your patience :).
Why trigrams
Trigram model is chosen for simplicity and clarity. Bigrams would give poor results, while 4 grams would require significantly more resources. In any case, it is quite simple to expand this algorithm to handle the general case of N-grams. However, it is worth considering that the more N, the more your text is similar to the original body.