We compare 2 approaches to text generation using neural networks: Char-RNN vs Word Embeddings + funny examples at the end. When it becomes absolutely nothing to read, I don’t want to open the book, all the articles on Habré are read, all the notifications on the phone are processed, and even spam in the boxes is viewed, I open Lentu.ru. My wife - a professional journalist - at such moments begins allergies, and it is clear why. After the old team left Ribbon in 2014, the yellowing level of the publication went up, and the quality of the text and editing went down. Over time, periodically continuing to read Lenta by inertia, I began to notice that the news headline models were repeated: “Found [insert pseudo-sensation]”, “Putin [did something]”, “Unemployed Muscovite [
description of his adventures ]” and etc. This was the first introductory.
The second introductory one - recently accidentally found a funny domestic equivalent of
@DeepDrumph (this is Twitter, in which the phrases generated by the neural network based on Trump's official twitter) are
laid out -
@neuromzan . Unfortunately, the author stopped uploading new tweets and hid himself, but the description of the idea is preserved
here .
')
And the idea came, why not do the same thing, but based on the headlines Tape.ru? It can turn out not less amusing, considering the level of absurdity of some real headings on this resource.
Note : it is the task of generating the subsequent text based on some introductory text that is considered here. This is not the task of Text Summarization, when, for example, its title is generated based on the text of the news. The text of the news in my case is not used at all.
Data
The prospect of downloading and parsing all the content of the Ribbon did not please me at all, and I began to search if anyone had already done it before me. And I was lucky: just a few days before I had this idea, Ildar Gabdrakhmanov
ildarchegg posted a
post where he describes how he robbed the content of the Ribbon and shares the full archive. In the end, he adds, “I hope that someone finds this data interesting and will be able to find use for them.” Of course! I already have an application! Thank you, Ildar! Your efforts saved me a few days!
So, we take this archive and pull out the articles for the period we need: from 04/01/2014 (when the old team left) to the present. I will not dwell on preliminary data processing. Who
cares , in the
repository on Gitkhab there is a separate laptop with a step by step description of the process.
The first attempt at training - Char-RNN
Judging by the above article, the author neuromzan'a used the architecture of Char-RNN, described by Andrei Karpaty (Andrej Karpathy) in his already become the legendary article "
The Unreasonable Effectiveness of Recurrent Neural Networks ". Let's try to solve our problem with this approach. The implementation of Char-RNN for different languages ​​and frameworks can be easily found on the Internet. I took
this one - it turns out to be quite compact on Keras. Having modified it a bit to fit my needs (changed the hyperparameters and added saving the results and more informative output), I started the training.
And the results did not please me. The idea of ​​Char-RNN is that the network learns to predict the next character based on the N previous ones. That is, some source text is supplied as input to the network, and further text is generated character by character. Therefore, in order to learn how to write relatively readable texts, the network must:
- To learn how to separate sequences of characters (pseudo-words) by spaces and sometimes put an end;
- Learn to generate sequences of characters that resemble real words;
- Learn how to generate words similar to the real ones, taking into account the previous almost real words.
This is what happened with a dual-layer network with 64 LSTM cells trained on a limited set of headers (<800). In brackets - introductory text for the network, generate the following 150 characters:
1 : "[ . ] . "; 2 : "[ ] . . . . . 2 "; 3 : "[ . ] . . . . . . . . "; 4 : "[ . : , ] . . "; 5 : "[ - . ] . . . . . . . . . "; ... 25 : "[ 7,2. ] . -. "; ... 50 : "[ - 2018 ] "
Further did not begin to train. It turns out some nonsense. Yes, this is similar to phrases where there are words separated by spaces and ending with dots. Some “words” are even real words from the Russian language. But it is impossible to read, and it is not at all like those beautiful and ridiculous remarks that were examples that inspired me. Changes in hyperparameters had virtually no effect on quality. Training in full dataset also did not greatly improve the results. Perhaps these comrades trained the network for much longer, added more layers and made them wider, but I decided to try a different approach.
Alternative method - Word Embeddings
I remembered that a little less than a year ago, on the
Udacity Deep Learning Foundation course, I had a task to write a network that would generate a script for the Simpsons (more specifically, one scene in Mo’s tavern), taking the original scenarios of similar scenes from 25 seasons of the cartoon for learning . In this case, another approach was used - Word Embeddings, when the text is generated not by character, but by words, but also on the basis of the distribution of probabilities of the occurrence of certain words given the previous words [1..N].
This approach has several significant advantages over Char-RNN:
- Networks do not need to learn to generate words - they themselves are atomic elements themselves, respectively, the generated phrases will initially consist only of words from our body, there will be no real words;
- The words in the phrases are more consistent with each other, because for the set of words {"in", "on", "drove", "car", "car"} the sequence ["went", "on", “Car”] than [“drove”, “in”, “car”].
- Higher learning speed, because for the same case for Char-RNN, tokens will be characters, and for Word Embeddings - words. The number of words in the text is obviously less than the characters, and for the machine that is in the first, that in the second case, the token is simply an index in the dictionary. Accordingly, in one iteration, Word Embeddings will need to process far fewer tokens than Char-RNN, and it will take approximately proportionally less time.
Quick hypothesis testing on a limited dataset (the same 2 layers and 64 LSTM cells, generate 100 of the following tokens):
1: "[]...................................................................................................." ... 4: "[]... .. ...... facebook........... .......... ............... .... ...... . ......... . ........... ..." ... 10: "[]. . . . .. iphone8. ... . -. . . . .. . . .. facebook . . . . facebook . android. . . . " ... 20: "[]. . . . . . 12. . .. . . . . . android . . - . . . . "
So what happens here. In the first era, the network sees that one of the most frequent tokens is a full stop. “And let me put points everywhere, since she has such a high frequency,” she thinks, and gets a gigantic mistake. Then she doubts for a couple of iterations, and on the 4th she decides that some words should be inserted between the points. “Wow! The error has become smaller, ”the network rejoices,“ We ​​must continue in this spirit, we will insert different words, but I really like the dots, so for now I will continue to put them. ” And he continues to put dots, diluting them with words. Gradually, she realizes that words should be put more often than dots, and some words should go close to each other, for example, she remembers some passages from the dataset: “extend the government”, “fulfill agreements”, “twin eater”, etc. By the 20th epoch, she already remembers rather long phrases, such as “published footage of the dropout of military equipment from the air.” And it shows that, in principle, the approach works, but on such a small set of data the network quickly retrains (even despite the dropout), and instead of unique phrases it produces memorized ones.
Let's see what happens if we train it on the full set of headings:
1: "[] .. . . . . . . . . . . . . 10 " ... 5: "[] . . 59 . . . - forbes . 10 . 300. . 3 . -. . " … 10: “[] . . . tor. . . . 72 . . . . . ”
Only 10 epochs were enough for loss to stop decreasing. As you can see, in this case the network also memorized some rather long pieces, but at the same time, there are also many relatively original phrases. You can often see how the network creates long phrases from other long ones, like this: “I called durov cattle and publicly erased a telegram” + “a drunk Muscovite stole a tractor and got into an accident with a taxi” = “a drunk Muscovite stole a tractor and publicly erased a telegram ".
However, all the same, in most cases the phrases are not as beautiful as those of DeepDrumph and neuromzan. What is wrong here? Need to train longer, deeper and wider? And then an insight came upon me. No, these guys have not found a magical architecture, issuing beautiful texts. They simply generate long texts, select potentially funny chunks and manually edit them! The final word for the man - that's the secret!
After some manual editing, you can get quite acceptable options:
- “Responded. Sands sale opportunity zammera ">>>" Sands responded to the possibility of selling the post of deputy mayor "
- "Vice Speaker found man" >>> "Vice Speaker found man"
- "Deputy Lebedev recognized as inappropriate the proposal of the offer of murder" >>> "Deputy Lebedev recognized as inappropriate the offer of murder"
… and so on.
There is another important point related to the Russian language. It is much more difficult to generate phrases for the Russian language, because the words and phrases in the sentence need to be coordinated. In English, of course, too, but to a much lesser extent. Here is an example:
"Car" >>> "car"
"By car" >>> "by car"
"I see the car" >>> "see the car"
That is, in English the word in different cases is the same token from the network point of view, and in Russian it is different due to different endings and other parts of the word. Therefore, the results of the model in English look more believable than in Russian. You can, of course, lemmatize words in a Russian corpus, but then the generated text will consist of such words, and here you can’t do without manual doping. By the way, I tried to do this using the
pymorphy2 module, but the result, in my personal opinion, was even worse in some places, despite the fact that the number of unique tokens (words) after normalization decreased by more than 2 times. After 20 epochs, the result was:
“[]. . . . . . . . . . . . . . facebook .. . . . .. ”
You may also notice that the original meaning of the word is often lost. For example, in the above passage, the name “Sands” pymorphy normalized to the word “sand” - a serious loss for the whole corpus.
findings
- Word Embeddings copes with text generation much better than Char-RNN;
- Other things being equal, the quality of the generation of English text is generally higher than that of Russian, due to the nature of the languages;
- When you hear that the AI ​​wrote the book (as it was the other day when it was announced that the algorithm wrote the Harry Potter book , divide this statement by 1000, because, most likely, the algorithm could only 1) generate the names and descriptions of the characters, or 2) generate a general outline of the story, or 3) generate a semi-absurd text, which was then long and stubbornly ruled and coordinated by the people-editors. Or all this taken together. But certainly, AI did not write a book from the first to the last word. Not at the same level of development is our technology.
Dessert
Well, in the end some pearls after manual edits, it seems to me, quite in the spirit of Lenta.ru:
- Ministry of Finance refused to party in the Ryazan region
- Australian rejected iPhone8 on Android
- Irkutsk left without air
- Police of St. Petersburg against the world of St. Petersburg
- China began sales of debtors
- Ministry of Economic Development reported a gay partnership against sanctions
- Greece made a teddy bear
- Putin made peace with the museum and with Solzhenitsyn
- Naked tourist died in Budyonnovsk
- Doctors dissuaded Gorbachev from parting with problems in the slave trade.
- In Britain, the semi-finals of the Confederations Cup
- MP Lebedev found inappropriate to fine taxi
- In the Federation Council offered to choose the honorary Leopoldov
- Deputy Speaker of the Duma justified the proposal to prepare a coup d'état
- Lithuanian oncologists in Moscow started treatment
- Central Bank will allocate money to repair the stone troll penis
- It is more difficult to decently joke about the new Moscow Region leading
- Drunk arrested provoked the State Duma
- Israel answered with acid in the face
- The head of the US Federal Reserve promised the absence of creams
- The problem of the barracks in Izhevsk decided through the bed
- On Yamal krechety first bred from Eurovision
- Moscow for the year will spend more Brazil
- Off the coast of Libya taxi from Ivanushek
- Published the death toll in the crash of a tourist vessel in the Rada
- Kim Kardashian accused of preparing another chemical attack
- The government has promised a lack of money for the internal affairs of Russia
- Thousand counselors prepared for work in Russia
- Breivik complained of air attack on his territory
- US demanded to give in the face Zhirkov
- Petersburg schizophrenic warned of a few days before blocking Telegram
The code with laptops and comments
is posted on Gitkhab . There is also a pre-trained network, a script for generating text (lenta_ai.py) and instructions for use. In 99% of cases, you will receive meaningless sets of words, and only occasionally something interesting will come across. Well, if you want to “just see”, I launched a small
web application on heroku , where you can generate headers without running the code on your machine.