Is it possible to determine by quotation which of the politicians is its author? Ukrainian NGO
Vox Ukraine makes the
VoxCheck project, within which it checks the statements of the highest-rated politicians. Recently, they posted the entire
base of proven quotes . I am just listening to courses on NLP and decided to check how accurately the author of the text of the quote can be determined.
Disclaimer . This article is written out of interest to the topic and the desire to try out the material studied in practice, without any claims to the most accurate and detailed analysis.
Python was used for analysis, the code is available on
github .
Data
The base now contains 1952 quotes with the following distribution by policy:
')

For analysis purposes, I selected people with> 200 quotes. Accordingly, Yury Boyko, Oleg Tyagnibok, Andrey Sadovoy and Vladimir Zelensky fell out of the analysis. There are 1667 quotes left in the array. Of the six remaining speakers, four (except for Groisman and Rabinovich) are registered candidates for the next presidential election.
Quotes are different, from short, about 30 characters (
"I have already submitted 112 bills." ) To long ones, about 1,200 characters. The average quotation length is about 200 characters (this, for example,
“Soon our children will lose a cow at the museum’s order with a dinosaur cheeks in nature management - for the results of this policy, and to carry out a ninja for livestock. Less than 2 years old). . " )
TF-IDF
For a start, let's see which words are more characteristic of certain speakers. Here are the top 10 words with the largest TF-IDF value for each candidate:

TF-IDF in briefTF-IDF (term frequency - inverse document frequency) is an indicator that evaluates the importance of a word in the context of a document. TF-IDF words are proportional to the frequency of use of the word in the document and inversely proportional to the frequency of use of the word in all documents in the collection. In the context of our data, high TF-IDF means that a politician often uses this word, and other politicians relatively less often.
Stemming (stemming) was used to calculate the TF-IDF - coercion of the word to the base.
Green highlights the words that I would like to comment on for each speaker in order to give a little context.
Oleg Lyashko:- Poland: Lyashko often mentions Poland in connection with the labor migration of Ukrainians there, and also compares incomes in Poland and Ukraine
- Cereals: Lyashko says that Ukraine exports grain and loses on it, because it could export more flour
- Oncology, drugs: Lyashko is an ardent opponent of the current medical reform and often says that the cost of oncology is almost not covered by the state
Poroshenko and
Gritsenko talk a lot about the military conflict, which is quite logical: Poroshenko is the president and, accordingly, the supreme commander, and Gritsenko is the military and the minister of defense.
Groisman is the Prime Minister, and mostly talks about the economy, including the public debt.
Vadim Rabinovich’s quotations do not trace a specific subject, perhaps because he speaks a lot (444 quotes from 1952, all others have less than 300 quotes).
Yulia Tymoshenko talks a lot about the gas transmission system of Ukraine, about the liquidation of banks, as well as about the low economic indicators of the country.
Citation classification
So, we have 6 classes (speakers). For classification I used a naive Bayesian classifier. The stop words of the Russian and Ukrainian languages ​​are excluded from the text (using the stopwords package). Included n-grams up to 2 in length (variants with a length of up to 3 were also tested, but showed over-fitting). The test sample is taken in the proportion of 20% of the total.
The total accuracy of the model (the proportion of correctly classified quotes) on the training sample is
74.8% , on the test sample
75.7%Cross author results:

The accuracy is highest for Vadim Rabinovich (97%) - most likely because he is the only Russian speaker out of six. High accuracy of classification by Groisman and Lyashko (78% and 77%).
Slightly above 60% accuracy indicators determine the quotes Poroshenko and Tymoshenko. Both of them are often identified by the model as Groisman. Groysman as prime minister often speaks about the economy in the form of a “progress report”, and Poroshenko’s and Tymoshenko’s misclassified quotations are also about that (only Poroshenko’s as a government official is positive, and Tymoshenko’s the opposite).
For example, here is a quote by Poroshenko, defined by the model as a quote by Groisman:
UAH 5 billion, (tobto) UAH 4 billion of that amount and UAH 1 billion worth of money are stored on Silsk medicineAs well as a quote by Tymoshenko, defined as a quote by Groisman:
In the offensive budget, on the basis of prisons, they saw two or more pennies, not much for science, but to be in the Academy of Sciences of Ukraine.The lowest accuracy (57%) is from Anatoly Gritsenko’s quotes. His model is often defined as Poroshenko (which is logical, given the military theme of their quotations), and also as Lyashko. In the case of Lyashko, the wrong classification is quotes criticizing the authorities, including, for example, about migration:
I don’t tell about the same member of your family, Volodymyr Borisovich, Mr. Klіmkіn, saying, my son left the marina.In general, it seems to me, for such short quotes of a similar format (oral presentations by politicians) and topics (Ukrainian politics) the result is not bad. By the way, on the same data I tried to make a model that determines the category of the quotation (true / false / manipulation), but the accuracy turned out to be very low. Which is basically logical: looking at a quote like “So much money was spent on this, but in such a country they spend so much on this” it is difficult to determine the veracity of the data contained in it :)