Working with text data in scikit-learn (translation of documentation) - part 2

This article is a translation of a chapter on textual data training from the official scikit-learn documentation . You can read the beginning of the article in part 1 .

Training classifier

Now that we have identified the signs, we can train the classifier to predict the category of the text. Let's start with the Naive Bayes classifier , which will be an excellent starting point for our task. scikit-learn includes several variants of this classifier. The most suitable for counting words is its poly nominal version:

>>> from sklearn.naive_bayes import MultinomialNB >>> clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In order for us to try to predict the results on a new document, we need to extract features (features) using almost the same sequence as before. The difference is that the transform method is used instead of the fit_transform from transformers, since they have already been applied to our training set:
')

 >>> docs_new = ['God is love', 'OpenGL on the GPU is fast'] >>> X_new_counts = count_vect.transform(docs_new) >>> X_new_tfidf = tfidf_transformer.transform(X_new_counts) >>> predicted = clf.predict(X_new_tfidf) >>> for doc, category in zip(docs_new, predicted): ... print('%r => %s' % (doc, twenty_train.target_names[category])) ... 'God is love' => soc.religion.christian 'OpenGL on the GPU is fast' => comp.graphics

Creating pipelining

To make it easier to work with the vectorizer =>> transformer => classifier chain, there is a Pipeline class in scikit-learn, which functions as a composite (pipeline) classifier:

 >>> from sklearn.pipeline import Pipeline >>> text_clf = Pipeline([('vect', CountVectorizer()), ... ('tfidf', TfidfTransformer()), ... ('clf', MultinomialNB()), ... ])

The name vect, tfidf and clf (classifier) is chosen arbitrarily by us. We will look at using them below in the chapter on grid search. Now we will train the model with just 1 command:

 >>> text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

Performance evaluation when working on a test sample

Evaluation of the accuracy of the forecast model is quite simple:

 >>> import numpy as np >>> twenty_test = fetch_20newsgroups(subset='test', ... categories=categories, shuffle=True, random_state=42) >>> docs_test = twenty_test.data >>> predicted = text_clf.predict(docs_test) >>> np.mean(predicted == twenty_test.target) 0.834...

For example, we got 83% accuracy. Let's see if we can improve this result with the help of the linear support vector machine (support vector machine (SVM)) . This method is usually considered the best text classification algorithm (although it is slightly slower than naive Bayes). We can change the learning model simply by connecting another classification object to our pipeline:

 >>> from sklearn.linear_model import SGDClassifier >>> text_clf = Pipeline([('vect', CountVectorizer()), ... ('tfidf', TfidfTransformer()), ... ('clf', SGDClassifier(loss='hinge', penalty='l2', ... alpha=1e-3, n_iter=5, random_state=42)), ... ]) >>> _ = text_clf.fit(twenty_train.data, twenty_train.target) >>> predicted = text_clf.predict(docs_test) >>> np.mean(predicted == twenty_test.target) 0.912...

scikit-learn also provides utilities for more detailed analysis of the results:

 >>> from sklearn import metrics >>> print(metrics.classification_report(twenty_test.target, predicted, ... target_names=twenty_test.target_names)) ... precision recall f1-score support alt.atheism 0.95 0.81 0.87 319 comp.graphics 0.88 0.97 0.92 389 sci.med 0.94 0.90 0.92 396 soc.religion.christian 0.90 0.95 0.93 398 avg / total 0.92 0.91 0.91 1502 >>> metrics.confusion_matrix(twenty_test.target, predicted) array([[258, 11, 15, 35], [ 4, 379, 3, 3], [ 5, 33, 355, 3], [ 5, 10, 4, 379]])

As expected, the inaccuracy matrix shows that the model from the newsgroups' sample of atheism and Christianity often confuses the model with each other than with the texts about computer graphics.

Setting options for using grid search

We have already calculated some parameters, such as use_idf in the TfidfTransformer function. Classifiers typically also have many parameters, for example, MultinomialNB includes a smoothing factor alpha, and the SGDClassifier has a penalty parameter alpha (penalty function method), a custom loss, and penalty terms in the objective function (see the documentation section or use the Python hint function for more information).
Instead of searching for parameters of various components in the circuit, you can run a search (using the exhaustive search method) for the best parameters in the grid of possible values. We tried all classifiers in words or bigrams, with or without idf, with penalty parameters of 0.01 and 0.001 for the SVM (support vector method):

 >>> from sklearn.grid_search import GridSearchCV >>> parameters = {'vect__ngram_range': [(1, 1), (1, 2)], ... 'tfidf__use_idf': (True, False), ... 'clf__alpha': (1e-2, 1e-3), ... }

Obviously, a similar search using the exhaustive search method can be resource intensive. If we have a lot of processor cores at our disposal, we can run a grid search to try all the parameter combinations in parallel with the n_jobs parameter. If we set this parameter to -1, the grid search will determine how many cores are installed and use them all:

 >>> gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

A grid search instance behaves like a regular scikit-learn model. Let's run the search on a small part of the training sample to increase the processing speed:

 >>> gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

As a result of applying the fit method on the GridSearchCV object, we get a classifier that can be used to execute the predict function:

 >>> twenty_train.target_names[gs_clf.predict(['God is love'])] 'soc.religion.christian'

but on the other hand, it is a very large and bulky object. We can still get the optimal parameters by examining the attribute of the grid_scores_ object, which is a list of pairs of parameters \ measure. To get the attributes of measures, we can:

 >>> best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1]) >>> for param_name in sorted(parameters.keys()): ... print("%s: %r" % (param_name, best_parameters[param_name])) ... clf__alpha: 0.001 tfidf__use_idf: True vect__ngram_range: (1, 1) >>> score 0.900...

Exercises

To complete the exercises, copy the contents of the 'skeletons' folder into a new folder called 'workspace':

 % cp -r skeletons workspace

You can edit the contents of the workspace folder without fear of losing the original instructions for the exercises.
Then open the ipython shell and run the incomplete script for the exercise:

 [1] %run workspace/exercise_XX_script.py arg1 arg2 arg3

If an exception was thrown, use% debug to start the emergency ipdb session.
Clear the implementation and repeat until you solve the problem.
In each exercise, the skeleton files contain all the necessary import instructions, code templates for loading data, and examples for the code to estimate model prediction accuracy.

Exercise 1: Language Definition

Write a pipeline - a textual classifier using a special preprocessing and CharNGramAnalyzer. As an educational sample, use articles from Wikipedia.
Evaluate performance on any test sample that does not coincide with the training one.

ipython command line:

 %run workspace/exercise_01_language_train_model.py data/languages/paragraphs/

Exercise 2: Preference analysis based on movie reviews

Write a conveyor - a text classifier to categorize movie reviews as positive and negative.
Select the appropriate parameter set using grid search.
Rate performance on the test sample.

ipython command line:

 %run workspace/exercise_02_sentiment.py data/movie_reviews/txt_sentoken/

Exercises 3: Utility - text classifier on the command line (console application)

Using the results of previous exercises and the cPickle module of the standard library, write a command line utility that determines the text language in stdin (keyboard input) and determines the polarity (positive or negative) if the text is written in English.

What's next?

In this section, some tips are given to help you learn more about scikit-learn after going through the exercises:

Try to use analyzer and token normalization in CountVectorizer
If you have no markup, try using clustering to solve your problem.
If your document is tagged with many tags, i.e. categories, look at the Multiclass and multilabel section
Try using Truncated SVD for latent semantic analysis .
Familiarize yourself with using Out-of-core Classification for learning data that does not fit in RAM.
Get to know Hashing Vectorizer , which requires less memory compared to CountVectorizer .

Source: https://habr.com/ru/post/266025/

All Articles