- Eh bien, mon prince. Gênes et Lucques ne sont plus que des apanages, des estates, de la famille Buonaparte. Je vous préviens que si vous ne me dites pas que nous avons la guerre, si vous vous permettez encore de pallier toutes les infamies, tous les atrocités de cet , vous n'êtes plus mon ami, vous n'êtes plus my faithful slave, comme vous dites 1. Well, hello, hello. Je vois que je vous fais peur 2, sit down and tell.
Recently, I came across this article https://habrahabr.ru/post/342738/ . And I wanted to write about word embeddings, python, gensim and word2vec. In this part, I will try to talk about learning the basic model of w2v.
So, we proceed.
utf-8
format.The first step is to download data for nltk.
import nltk nltk.dwonload()
In the opened window, select everything, and go to drink coffee. It takes about half an hour.
By default, there is no Russian language in the library. But the craftsmen did everything for us. Download https://github.com/mhq/train_punkt and extract everything to a folderC:\Users\<username>\AppData\Roaming\nltk_data\tokenizers\punkt
andC:\Users\<username>\AppData\Roaming\nltk_data\tokenizers\punkt\PY3
.
Nltk we will use to break down the text into sentences, and sentences into words. To my surprise, it all works pretty quickly. Well, enough of the settings, a couple already write at least a line of normal code.
Create a folder where there will be scripts and data. Create an enviroment.
conda create -n tolstoy-like
Activate.
activate tolstoy
There we throw the text. anna.txt
file anna.txt
For owners of PyCharm, you can simply create a project and select an anaconda as an interpreter without creating an environment.
Create a script train-I.py
.
We connect dependences.
# -*- coding: utf-8 -*- # imports import gensim import string from nltk.tokenize import sent_tokenize from nltk.corpus import stopwords from nltk.tokenize import word_tokenize
Read the text.
# load text text = open('./anna.txt', 'r', encoding='utf-8').read()
Now it's the turn of the tokenizer of Russian sentences.
def tokenize_ru(file_text): # firstly let's apply nltk tokenization tokens = word_tokenize(file_text) # let's delete punctuation symbols tokens = [i for i in tokens if (i not in string.punctuation)] # deleting stop_words stop_words = stopwords.words('russian') stop_words.extend(['', '', '', '', '', '', '', '—', '–', '', '', '...']) tokens = [i for i in tokens if (i not in stop_words)] # cleaning words tokens = [i.replace("«", "").replace("»", "") for i in tokens] return tokens
Let us dwell on this in more detail. In the first line we break the sentence (string) into words (array of strings). Then we delete the punctuation, which nltk, for some reason, makes as a separate word. Now stop words. These are words from which our model will not benefit, they will only knock it off from the main text. These include interjections, conjunctions and some pronouns, as well as some of the favorite words, parasites. Then remove the quotes that in this novel over the edge.
Now break the text into sentences, and sentences into an array of words.
sentences = [tokenize_ru(sent) for sent in sent_tokenize(text, 'russian')]
For interest, we derive the number of sentences and a couple of them.
print(len(sentences)) # 20024 print(sentences[200:209]) # [['', '', '', '', '', ''],...]
Now we begin to train the model. Do not be afraid it does not take half an hour - 20024 sentences for gensim just spit it out.
# train model model = gensim.models.Word2Vec(sentences, size=150, window=5, min_count=5, workers=4)
# save model model.save('./w2v.model') print('saved')
Save the file. To run, those who work in PyCharm or Spyder just press run. Whoever writes by hand from a notebook or another editor will have to run Anaconda Promt (to do this, just enter it in the search menu), go to the directory with the script and run the command
python train-I.py
Is done. Now you can proudly say that you trained word2vec.
No matter how hard we try, Anna Karenins are not enough to train the model. Therefore, we use the second work of the author - War and Peace.
You can download it from here , also in TXT format. Before use, you will have to combine two files into one. We throw in the directory from the first chapter, call war.txt
. One of the charm of using gensim is that any loaded model can be updated with new data. This is what we will do.
Create a train-II.py
script
I think that this part does not need explanations, since there is nothing new in it.
# -*- coding: utf-8 -*- # imports import gensim import string from nltk.tokenize import sent_tokenize from nltk.corpus import stopwords from nltk.tokenize import word_tokenize # load text text = open('./war.txt', 'r', encoding='utf-8').read() def tokenize_ru(file_text): # firstly let's apply nltk tokenization tokens = word_tokenize(file_text) # let's delete punctuation symbols tokens = [i for i in tokens if (i not in string.punctuation)] # deleting stop_words stop_words = stopwords.words('russian') stop_words.extend(['', '', '', '', '', '', '', '—', '–', '', '', '...']) tokens = [i for i in tokens if (i not in stop_words)] # cleaning words tokens = [i.replace("«", "").replace("»", "") for i in tokens] return tokens # tokenize sentences sentences = [tokenize_ru(sent) for sent in sent_tokenize(text, 'russian')] print(len(sentences)) # 30938 print(sentences[200:209]) # [['', '', '', '', '', '', '', '', '', '', '', '', '', '', ''],...]
Then we load our model, and we feed it new data.
# train model part II model = gensim.models.Word2Vec.load('./w2v.model') model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)
I’ll stop here a bit. total_examples
sets the number of words, in our case this is the entire model dictionary ( model.corpus_count
), including the new ones. A `epochs
number of iterations. Honestly, I myself do not know what model.iter
means from the documentation. Who knows, please write in the comments - correct.
# save model model.save('./w2v-II.model') print('saved')
Do not forget to run.
They are not. And until there will be. The model is not quite perfect yet, frankly, it is terrible. In the next article I will definitely tell you how to fix it. But here's your last thing:
# -*- coding: utf-8 -*- # imports import gensim model = gensim.models.Word2Vec.load('./w2v-II.model') print(model.most_similar(positive=['', ''], negative=[''], topn=1))
PS
Not so bad, actually. The resulting dictionary contains about 5 thousand words with their dependencies and relationships. In the next article I will give a more perfect model (15,000 words). More talk about the preparation of the text. And finally, in the third part, I will publish the final model and tell you how to write a program using neural networks to generate a text in the style of Tolstoy.
References and used literature.
I hope you liked my article a little.
Source: https://habr.com/ru/post/343704/
All Articles