This article is a translation of a chapter on textual data training from the official scikit-learn documentation . You can read the beginning of the article in part 1 .Training classifier
Now that we have identified the signs, we can train the classifier to predict the category of the text. Let's start with the
Naive Bayes classifier , which will be an excellent starting point for our task. scikit-learn includes several variants of this classifier. The most suitable for counting words is its poly nominal version:
>>> from sklearn.naive_bayes import MultinomialNB >>> clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
In order for us to try to predict the results on a new document, we need to extract features (features) using almost the same sequence as before. The difference is that the transform method is used instead of the fit_transform from transformers, since they have already been applied to our training set:
')
>>> docs_new = ['God is love', 'OpenGL on the GPU is fast'] >>> X_new_counts = count_vect.transform(docs_new) >>> X_new_tfidf = tfidf_transformer.transform(X_new_counts) >>> predicted = clf.predict(X_new_tfidf) >>> for doc, category in zip(docs_new, predicted): ... print('%r => %s' % (doc, twenty_train.target_names[category])) ... 'God is love' => soc.religion.christian 'OpenGL on the GPU is fast' => comp.graphics
Creating pipelining
To make it easier to work with the vectorizer =>> transformer => classifier chain, there is a Pipeline class in scikit-learn, which functions as a composite (pipeline) classifier:
>>> from sklearn.pipeline import Pipeline >>> text_clf = Pipeline([('vect', CountVectorizer()), ... ('tfidf', TfidfTransformer()), ... ('clf', MultinomialNB()), ... ])
The name vect, tfidf and clf (classifier) is chosen arbitrarily by us. We will look at using them below in the chapter on grid search. Now we will train the model with just 1 command:
>>> text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
Performance evaluation when working on a test sample
Evaluation of the accuracy of the forecast model is quite simple:
>>> import numpy as np >>> twenty_test = fetch_20newsgroups(subset='test', ... categories=categories, shuffle=True, random_state=42) >>> docs_test = twenty_test.data >>> predicted = text_clf.predict(docs_test) >>> np.mean(predicted == twenty_test.target) 0.834...
For example, we got 83% accuracy. Let's see if we can improve this result with the help of the linear
support vector machine (support vector machine (SVM)) . This method is usually considered the best text classification algorithm (although it is slightly slower than naive Bayes). We can change the learning model simply by connecting another classification object to our pipeline:
>>> from sklearn.linear_model import SGDClassifier >>> text_clf = Pipeline([('vect', CountVectorizer()), ... ('tfidf', TfidfTransformer()), ... ('clf', SGDClassifier(loss='hinge', penalty='l2', ... alpha=1e-3, n_iter=5, random_state=42)), ... ]) >>> _ = text_clf.fit(twenty_train.data, twenty_train.target) >>> predicted = text_clf.predict(docs_test) >>> np.mean(predicted == twenty_test.target) 0.912...
scikit-learn also provides utilities for more detailed analysis of the results:
>>> from sklearn import metrics >>> print(metrics.classification_report(twenty_test.target, predicted, ... target_names=twenty_test.target_names)) ... precision recall f1-score support alt.atheism 0.95 0.81 0.87 319 comp.graphics 0.88 0.97 0.92 389 sci.med 0.94 0.90 0.92 396 soc.religion.christian 0.90 0.95 0.93 398 avg / total 0.92 0.91 0.91 1502 >>> metrics.confusion_matrix(twenty_test.target, predicted) array([[258, 11, 15, 35], [ 4, 379, 3, 3], [ 5, 33, 355, 3], [ 5, 10, 4, 379]])
As expected, the inaccuracy matrix shows that the model from the newsgroups' sample of atheism and Christianity often confuses the model with each other than with the texts about computer graphics.
Setting options for using grid search
We have already calculated some parameters, such as use_idf in the TfidfTransformer function. Classifiers typically also have many parameters, for example, MultinomialNB includes a smoothing factor alpha, and the SGDClassifier has a penalty parameter alpha (penalty function method), a custom loss, and penalty terms in the objective function (see the documentation section or use the Python hint function for more information).
Instead of searching for parameters of various components in the circuit, you can run a search (using the exhaustive search method) for the best parameters in the grid of possible values. We tried all classifiers in words or bigrams, with or without idf, with penalty parameters of 0.01 and 0.001 for the SVM (support vector method):
>>> from sklearn.grid_search import GridSearchCV >>> parameters = {'vect__ngram_range': [(1, 1), (1, 2)], ... 'tfidf__use_idf': (True, False), ... 'clf__alpha': (1e-2, 1e-3), ... }
Obviously, a similar search using the exhaustive search method can be resource intensive. If we have a lot of processor cores at our disposal, we can run a grid search to try all the parameter combinations in parallel with the n_jobs parameter. If we set this parameter to -1, the grid search will determine how many cores are installed and use them all:
>>> gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
A grid search instance behaves like a regular scikit-learn model. Let's run the search on a small part of the training sample to increase the processing speed:
>>> gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])
As a result of applying the fit method on the GridSearchCV object, we get a classifier that can be used to execute the predict function:
>>> twenty_train.target_names[gs_clf.predict(['God is love'])] 'soc.religion.christian'
but on the other hand, it is a very large and bulky object. We can still get the optimal parameters by examining the attribute of the grid_scores_ object, which is a list of pairs of parameters \ measure. To get the attributes of measures, we can:
>>> best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1]) >>> for param_name in sorted(parameters.keys()): ... print("%s: %r" % (param_name, best_parameters[param_name])) ... clf__alpha: 0.001 tfidf__use_idf: True vect__ngram_range: (1, 1) >>> score 0.900...
Exercises
To complete the exercises, copy the contents of the 'skeletons' folder into a new folder called 'workspace':
% cp -r skeletons workspace
You can edit the contents of the workspace folder without fear of losing the original instructions for the exercises.
Then open the ipython shell and run the incomplete script for the exercise:
[1] %run workspace/exercise_XX_script.py arg1 arg2 arg3
If an exception was thrown, use% debug to start the emergency ipdb session.
Clear the implementation and repeat until you solve the problem.
In each exercise, the skeleton files contain all the necessary import instructions, code templates for loading data, and examples for the code to estimate model prediction accuracy.Exercise 1: Language Definition
- Write a pipeline - a textual classifier using a special preprocessing and CharNGramAnalyzer. As an educational sample, use articles from Wikipedia.
- Evaluate performance on any test sample that does not coincide with the training one.
ipython command line:
%run workspace/exercise_01_language_train_model.py data/languages/paragraphs/
Exercise 2: Preference analysis based on movie reviews
- Write a conveyor - a text classifier to categorize movie reviews as positive and negative.
- Select the appropriate parameter set using grid search.
- Rate performance on the test sample.
ipython command line:
%run workspace/exercise_02_sentiment.py data/movie_reviews/txt_sentoken/
Exercises 3: Utility - text classifier on the command line (console application)
Using the results of previous exercises and the cPickle module of the standard library, write a command line utility that determines the text language in stdin (keyboard input) and determines the polarity (positive or negative) if the text is written in English.
What's next?
In this section, some tips are given to help you learn more about scikit-learn after going through the exercises: