Once I wondered if the British and American literature was different in terms of the choice of words, and if it was different, would I be able to train a classifier who could distinguish literary texts in terms of the frequency of the words used. It is quite easy to distinguish texts written in different languages, the power of intersection of a set of words is small relative to the set of words in the sample. Text classification by categories “science”, “Christianity”, “computer graphics”, “atheism” is a well-known hello world among the tasks of working with text frequency. I was faced with a more difficult task, since I compared two dialects of the same language, and the texts did not have a common semantic orientation.
The longest stage of machine learning is data extraction. For the training sample, I used texts from Project Gutenberg , they can be freely downloaded. I downloaded a list of American and British authors from Wikipedia. The difficulty was in finding a match by the name of the author. A good search by names is implemented on the project site, but parsing the site is prohibited; instead, it is proposed to use an archive with metadata. This meant that I needed to solve a non-trivial name matching problem (Sir Arthur Ignatius Conan Doyle and Doyle, C. — the same people, but Doyle, ME anymore) and do it with very high accuracy. Instead, I, donating the sample size in favor of accuracy and saving my time, chose as a unique identifier a link to the author's Wikipedia, which was included in some metadata files. So I received about 1600 British texts and 2500 American and began to train the classifier.
For all operations, I used the sklearn package. The first step after collecting and analyzing data is preprocessing, for which I took CountVectorizer. It takes an array of text data as input and returns a feature vector. Further, it is required to present the signs in a numeric format, since the classifier works with numeric data. To do this, you need to calculate tf-idf, term frequency - inverted document frequency , using TfidfTransformer.
A short example of how this is done and why:
Take the word “the” and calculate the number of occurrences of this word in the text A. Let us have 100 occurrences, and the total number of words in the document is 1000,
tf(“the”) = 100/1000 = 0.1
Next, take the word “sepal” (sepal), which has occurred 50 times,
tf(“sepal”) = 50/1000 = 0.05
To calculate the inverted document frequency for these words, you need to take the logarithm of the ratio of the number of texts, in which there is at least one occurrence of this word, to the total number of texts. If there are 10,000 texts in all, and each has the word “the”,
idf(“the”) = log(10000/10000) = 0
tf-idf(“the”) = idf(“the”) * tf(“the”) = 0 * 0.1 = 0
The word “sepal” is much rarer, and is found only in 5 texts, therefore
idf(“sepal”) = log(10000/5) = 7.6
tf-idf(“sepal”) = 7.6 * 0.05 = 0.38
Thus, frequent words have minimal weight, and specific rare ones are large, and due to the large occurrence of the word “sepal” in the text, it can be assumed that it is somehow related to botany.
Now that the data is presented as a set of features, you need to train the classifier. I work with text that is presented as sparse data, so the linear classifier, which copes well with classification tasks with a large number of features, is the best option. I trained with CountVectorizer, TF-IDFTransformer, and SGD with default parameters. You can work separately with each stage, but it is more convenient to use the pipeline:
pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier()), ])
After analyzing the graph of accuracy on the sample size, I noticed strong fluctuations in accuracy even on large sample sizes, which indicated that the classifier is very dependent on a particular sample, and therefore not very effective, and significant improvements are required. Having received the list of classifier weights, I noticed a part of the problem: the algorithm was retrained on frequent words like “of” and “he”, which are in fact noise. This problem is easily solved by removing similar words from features, and is set by the CountVectorizer parameter stop_words = 'english' or by your own word list. A little improved accuracy, the removal of some popular common words. By removing the stop words, I got an accuracy improvement of 0.85.
Next, I started setting parameters using GridSearchCV. This method reveals the best combination of parameters for the CountVectorizer, TfidfTransformer and SGDClassifier, so this is a very long process, I have considered it for about a day. As a result, I received such a pipeline:
pipeline = Pipeline([ ('vect', CountVectorizer(stop_words = modifyStopWords(), ngram_range = (1, 1))), ('tfidf', TfidfTransformer(use_idf = True, norm = 'l2', smooth_idf = True)), ('clf', SGDClassifier(alpha=0.001, fit_intercept = True, n_iter = 10, penalty = 'l2', loss = 'epsilon_insensitive')), ])
Total accuracy - 0.89.
Now the most interesting for me: what words indicate the origin of the text. Here is a list of words, sorted by descending module of weight in the classifier:
American text : dollars, new, york, girl, gray, american, carvel, color, city, ain, long, just, parlor, boston, honor, washington, home, labor, got, finally, maybe, hodder, forever, dorothy dr
British text : round, sir, lady, london, quite, mr, shall, lord, gray, dear, honor, having, philip, poor, pounds, scrooge, soames, things, sea, man, end, come, color, illustration , english, learnt
Having fun with the classifier, I got the most "British" authors from among the Americans and the most "American" British (an elegant way to tell about how much my classifier can make mistakes):
The most "British" Americans:
The most "American" British:
And also the most "British" British and "American" Americans (because the classifier is still good).
Americans:
British:
On an attempt to make such a classifier, I was pushed by the @TragicAllyHere tweet :
I would love to be British. The phone booth, adding letters to wourds.
The code can be taken here , the already trained classifier is also available there.
Source: https://habr.com/ru/post/319826/
All Articles