📜 ⬆️ ⬇️

Learning a computer to write like Tolstoy, Volume I

- Eh bien, mon prince. Gênes et Lucques ne sont plus que des apanages, des estates, de la famille Buonaparte. Je vous préviens que si vous ne me dites pas que nous avons la guerre, si vous vous permettez encore de pallier toutes les infamies, tous les atrocités de cet , vous n'êtes plus mon ami, vous n'êtes plus my faithful slave, comme vous dites 1. Well, hello, hello. Je vois que je vous fais peur 2, sit down and tell.

VOLUME FIRST


PART ONE. Anna Karenina


Recently, I came across this article https://habrahabr.ru/post/342738/ . And I wanted to write about word embeddings, python, gensim and word2vec. In this part, I will try to talk about learning the basic model of w2v.


So, we proceed.



The first step is to download data for nltk.


 import nltk nltk.dwonload() 

In the opened window, select everything, and go to drink coffee. It takes about half an hour.
By default, there is no Russian language in the library. But the craftsmen did everything for us. Download https://github.com/mhq/train_punkt and extract everything to a folder
C:\Users\<username>\AppData\Roaming\nltk_data\tokenizers\punkt and
C:\Users\<username>\AppData\Roaming\nltk_data\tokenizers\punkt\PY3 .


Nltk we will use to break down the text into sentences, and sentences into words. To my surprise, it all works pretty quickly. Well, enough of the settings, a couple already write at least a line of normal code.
Create a folder where there will be scripts and data. Create an enviroment.


 conda create -n tolstoy-like 

Activate.


 activate tolstoy 

There we throw the text. anna.txt file anna.txt
For owners of PyCharm, you can simply create a project and select an anaconda as an interpreter without creating an environment.


Create a script train-I.py .



Save the file. To run, those who work in PyCharm or Spyder just press run. Whoever writes by hand from a notebook or another editor will have to run Anaconda Promt (to do this, just enter it in the search menu), go to the directory with the script and run the command


 python train-I.py 

Is done. Now you can proudly say that you trained word2vec.


PART TWO. War and Peace


No matter how hard we try, Anna Karenins are not enough to train the model. Therefore, we use the second work of the author - War and Peace.


You can download it from here , also in TXT format. Before use, you will have to combine two files into one. We throw in the directory from the first chapter, call war.txt . One of the charm of using gensim is that any loaded model can be updated with new data. This is what we will do.
Create a train-II.py script



Do not forget to run.


EPILOGUE. And where are the tests?


They are not. And until there will be. The model is not quite perfect yet, frankly, it is terrible. In the next article I will definitely tell you how to fix it. But here's your last thing:


 # -*- coding: utf-8 -*- # imports import gensim model = gensim.models.Word2Vec.load('./w2v-II.model') print(model.most_similar(positive=['', ''], negative=[''], topn=1)) 

PS


Not so bad, actually. The resulting dictionary contains about 5 thousand words with their dependencies and relationships. In the next article I will give a more perfect model (15,000 words). More talk about the preparation of the text. And finally, in the third part, I will publish the final model and tell you how to write a program using neural networks to generate a text in the style of Tolstoy.

References and used literature.



Successes you in machine learning.


I hope you liked my article a little.


')

Source: https://habr.com/ru/post/343704/


All Articles