Working with text data in scikit-learn (translation of documentation) - part 1

This article presents a translation of a chapter on textual data training from official scikit-learn documentation.

The purpose of this chapter is to study some of the most important tools in scikit-learn on one particular task: analyzing a collection of text documents (news articles) on 20 different topics.
In this chapter, we will look at how:

upload file and category contents
highlight feature vectors suitable for machine learning
train a one-dimensional model to perform categorization
use the grid search strategy to find the best configuration for feature extraction and for the classifier

Installation Instructions

To begin the practice session described in this chapter, you must have scikit-learn installed and all the components on which it depends (numpy, Scipy).
For installation instructions and recommendations for different operating systems, go to this page .
You can find the local copy of this lesson in your folder:
scikit-learn / doc / tutorial / text_analytics /
Now scikit-learn is not installed with the doc / folder and other content. You can download it from github.
The learning examples folder should contain the following files and folders:

* .rst files - source of training documents processed using sphinx
data - folder for storing data sets in the learning process
skeletons - samples of incomplete exercise scripts
solutions - exercise solutions

You can also copy the skeletons to a new folder anywhere on your hard drive, called sklearn_tut_workspace, where you will edit your own exercise files. So the original skeletons will remain unchanged:

% cp -r skeletons work_directory/sklearn_tut_workspace

Machine learning algorithms need data. Go to each $ TUTORIAL_HOME / data subfolder and run the fetch_data.py script from there (first read them).
For example:

 % cd $TUTORIAL_HOME/data/languages % less fetch_data.py % python fetch_data.py

Download 20 news datasets

The data set is called “Twenty Newsgroups”. Here is its official description, taken from the site :

The “The 20 Newsgroups” data is a collection of about 20,000 news documents, divided (approximately) evenly among 20 different categories. As far as we know, it was originally collected by Ken Leng (Ken Lang), perhaps for his work “Newsweeder: Learning to filter netnews” (“News Browser: learning to filter news from the network”), although he did not explicitly state this. The 20 newsgroups collection has become a popular set of data for experimenting with machine learning techniques for text-based applications, such as text classification or clustering.

Next, we will use the built-in dataset loader to fetch “The 20 newsgroups” from scikit-learn. Otherwise, the sample can be downloaded manually from the web site, use the sklearn.datasets.load_files function and specifying the “20news-bydate-train” folder to save the unpacked archive.
To make the first example run faster, we will work only with a part of our data set, divided into 4 categories out of 20 possible:

 >>> categories = ['alt.atheism', 'soc.religion.christian', ... 'comp.graphics', 'sci.med']

We can upload a list of files matching the desired categories, as shown below:

 >>> from sklearn.datasets import fetch_20newsgroups >>> twenty_train = fetch_20newsgroups(subset='train', ... categories=categories, shuffle=True, random_state=42)

The returned data set is the scikit-learn set: a one-dimensional container with fields that can be interpreted as keys in the python dictionary (dict keys), in other words, as signs of the object attributes. For example, target_names contains a list of the names of the categories requested:

 >>> twenty_train.target_names ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

The files themselves are loaded into memory as the data attribute. You can also refer to the file names:

 >>> len(twenty_train.data) 2257 >>> len(twenty_train.filenames) 2257

Let's print the first lines of the first file loaded:

 >>> print("\n".join(twenty_train.data[0].split("\n")[:3])) From: sd345@city.ac.uk (Michael Collier) Subject: Converting images to HP LaserJet III? Nntp-Posting-Host: hampton >>> print(twenty_train.target_names[twenty_train.target[0]]) comp.graphics

Algorithms for learning with a teacher (supervised learning) require that each document in a training set has a litter of a certain category. In our case, the category is the name of the news sample, which “by chance” turns out to be the name of the folder containing the characteristic documents.
To increase speed and efficient use of memory, scikit-learn loads the target attribute as an array of integers that corresponds to the index of the category name from the target_names list. The category index of each sample is stored in the target attribute:

 >>> twenty_train.target[:10] array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])

You can get the category name:

 >>> for t in twenty_train.target[:10]: ... print(twenty_train.target_names[t]) ... comp.graphics comp.graphics soc.religion.christian soc.religion.christian soc.religion.christian soc.religion.christian soc.religion.christian sci.med sci.med sci.med

You may notice that the samples were randomly shuffled (using a randomly generated number - fixed RNG seed). This method is suitable if you want to use only the first samples for quick training of the model and if you want to get a general idea of the results before the subsequent retraining on the full set of data.
')

Extracting features from text files

To use machine learning on text documents, first of all, you need to translate the text content into a numerical feature vector.

"Bag of words" (set of words)

The most intuitive way to do the transformation described above is to present the text as a set of words:

assign a unique integer index to each word appearing in the documents in the training set (for example, building a dictionary of words with integer indices).
for each document #i, count the number of uses of each word w and save it (the number) in X [i, j]. This will be the value of the sign #j, where j is the index of the word w in the dictionary.

The “bag of words” view implies that n_features are a number of unique words in the package. Usually, this amount exceeds 100,000.
If n_samples == 10000, then X, saved as a float32 numpy array, would require 10000 x 100000 x 4 bytes = 4GB of RAM (RAM) , which is hardly feasible in modern computers.
Fortunately, most of the values in X are zeros , since less than a couple of hundred unique words are used in one document. Therefore, the “bag of words” is often a highly dimensional sparse data set . We can save a lot of free RAM by storing only non-zero parts of the feature vectors.
The scipy.sparse matrices are the data structures that do just that - structure the data. Scikit-learn has built-in support for these structures.

Tokenize text with scikit-learn

Text preprocessing, tokenization and filtering of stop words are included in the high-level component, which allows you to create a dictionary of characteristic features and translate documents into feature vectors:

 >>> from sklearn.feature_extraction.text import CountVectorizer >>> count_vect = CountVectorizer() >>> X_train_counts = count_vect.fit_transform(twenty_train.data) >>> X_train_counts.shape (2257, 35788)

CountVectorizer supports the counting of N-gram words or character sequences. The vectorizer builds a dictionary of feature indexes:

 >>> count_vect.vocabulary_.get(u'algorithm') 4690

The index value of a word in a dictionary is related to its frequency of use throughout the training building.

From use to frequency

Counting word usage is a good start, but there is a problem: in long documents, the average number of word usage will be higher than in short ones, even if they are devoted to one topic.
To avoid these potential inconsistencies, it is sufficient to divide the number of uses of each word in the document by the total number of words in the document. This new feature is called tf - Frequency term.
The next refinement of the tf measure is the weight loss of the word that appears in many documents in the corpus, and hence is less informative than those used only in a small part of the corpus. As an example of low-normative words, there can be official words, articles, prepositions, conjunctions, etc.
This decrease is called tf – idf , which means “Term Frequency times Inverse Document Frequency”.
Both measures tf and tf – idf can be calculated as follows:

 >>> from sklearn.feature_extraction.text import TfidfTransformer >>> tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts) >>> X_train_tf = tf_transformer.transform(X_train_counts) >>> X_train_tf.shape (2257, 35788)

In the example code above, we first use the fit (..) method to run our evaluation algorithm on the data, and then the transform (..) method to transform our numeric matrix to the tf-idf representation. These two steps can be combined and give the same result at the output, but faster, which can be done by skipping over processing. To do this, use the fit_transform (..) method, as shown below:

 >>> tfidf_transformer = TfidfTransformer() >>> X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) >>> X_train_tfidf.shape (2257, 35788)

...
Continuation will be in part 2 .

Source: https://habr.com/ru/post/264339/

All Articles