Introduction to the analysis of textual information using Python and machine learning methods

Introduction

Today I will continue the story about the application of data analysis and machine learning methods with practical examples. In the last article we dealt with the task of credit scoring. Below, I will try to demonstrate the solution of another problem from the same tournament , namely, “Passport Tasks” (Task No. 2).

The solution will show the basics of textual information analysis, as well as its coding for building a model using Python and data analysis modules ( pandas , scikit-learn , pymorphy ).

Formulation of the problem

When working with large amounts of data, it is important to keep them clean. And when filling in an application for a banking product, it is necessary to indicate full passport data, including the “passport issued by whom” field; the number of different spellings of the same department by potential clients can reach several hundred. It is important to understand whether the client was not mistaken in filling in other fields: "division code", "passport number / series". To do this, you need to check the "division code" and "who issued the passport."

The task is to put down the codes of divisions for the records from the test sample , based on the training sample .

Preliminary data processing

Load the data and see what we have:

from pandas import read_csv import pymorphy2 from sklearn.feature_extraction.text import HashingVectorizer from sklearn.cross_validation import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, roc_auc_score from sklearn.decomposition import PCA train = read_csv('https://static.tcsbank.ru/documents/olymp/passport_training_set.csv',';', index_col='id' ,encoding='cp1251') train.head(5)

	passport_div_code	passport_issuer_name	passport_issue_month / year
id
one	422008	BELOVSKY ATC KEMEROVSK REGION	11M2001
2	500112	TP №2 V GOR. NUT-ZUYEVO OUFMS RUSSIA IN MO ...	03M2009
3	642001	VOLGA ROVD GOR.SARATOV	04M2002
four	162004	ATC MOSCOW DISTRICT KAZAN	12M2002
five	80001	THE DEPARTMENT OF OF RUSSIAN FEDERATION BY RESP. KALMYKIA IN G ELIST	08M2009

Now you can see how users record the "passport issued by anyone" field using the example of a department:

')

 example_code = train.passport_div_code[train.passport_div_code.duplicated()].values[0] for i in train.passport_issuer_name[train.passport_div_code == example_code].drop_duplicates(): print i

DEPARTMENT OF UFMS OF RUSSIA IN THE REPUBLIC OF KARELIA IN BEAR. -

THE DEPARTMENT OF THE UFMS OF RUSSIA ACCORDING TO R. KARELIA IN MEDVEZHEGORSK REGION

DEPARTMENT OF UFMS OF RUSSIA IN RESP KARELIA IN MEDVEZHEGORSK reg.

DEPARTMENT OF UFMS OF RUSSIA IN THE REPUBLIC OF KARELIA IN MEDVEZHEGORSK DISTRICT

OUFMS RUSSIA IN THE REPUBLIC OF KARELIA IN MEDVEZHEGORSK DISTRICT

UFMS of Russia in Kazakhstan in Medvezhiegorsky district

THE DEPARTMENT OF THE UFMS OF RUSSIA IN THE REPUBLIC OF KARELIA BEARS OF MEDVEZHEGORSK R-ONE

DEPARTMENT OF UFMS OF RUSSIA IN RK IN MEDVEZHEGORSK DISTRICT

THE DEPARTMENT OF THE UFMS OF RUSSIA IN THE REPUBLIC OF KORELIA IN MEDVAJIGOR DISTRICT

UFMS OF RUSSIA ACROSS KARELIA OF MEDVEJEGORSKA REGION

THE DEPARTMENT OF UFMS OF RUSSIA IN THE REPUBLIC OF KARELIA IN MEDWEZHORGIA

UFMS REPUBLIC OF KARELIA BEARSHIP R-ON

BEAR OF LAW

As you can see, the field is really filled crookedly. But for normal coding, we must bring this field to a more or less normal (unambiguous) form.

To begin with, I would suggest to bring all the entries to one register, for example, so that all letters become lowercase. This is easily done using the str attribute, a DataFrame column. This attribute allows you to work with a column as with a string, as well as perform various kinds of search and replace for regular expressions:

 train.passport_issuer_name = train.passport_issuer_name.str.lower() train[train.passport_div_code == example_code].head(5)

	passport_div_code	passport_issuer_name	passport_issue_month / year
id
nineteen	100010	Department of the UFMS of Russia in the Republic of Karelia in ...	04M2008
22	100010	Branch Ufms Russia on the river. Karelia in the bear ...	10M2009
5642	100010	Department of the Ufms of Russia in Karelia in Medve ...	08M2008
6668	100010	Department of the UFMS of Russia in the Republic of Karelia in ...	08M2011
8732	100010	Department of the UFMS of Russia in the Republic of Karelia in ...	08M2012

C register defined. Then you need to get rid of popular cuts, for example, a district, a city, etc. We do this with regular expressions. Pandas provides convenient use of regular expressions for each column. It looks like this:

 train.passport_issuer_name = train.passport_issuer_name.str.replace(u'-(||||)*',u'') train.passport_issuer_name = train.passport_issuer_name.str.replace(u' ( |\.|((\.| )))', u'  ') train.passport_issuer_name = train.passport_issuer_name.str.replace(u' (\.| )', u'  ') train.passport_issuer_name = train.passport_issuer_name.str.replace(u' ([-]*)(\.)?', u'  ') train.passport_issuer_name = train.passport_issuer_name.str.replace(u' (\.| |( )?)', u'  ') train.passport_issuer_name = train.passport_issuer_name.str.replace(u'  ', u'   ')

Now we will get rid of all unnecessary characters, except Russian letters, hyphens and spaces. This is due to the fact that a passport of the same subdivision can be issued by departments with different numbers, and this will worsen further coding:

 train.passport_issuer_name = train.passport_issuer_name.str.replace(u' - ?', u'-') train.passport_issuer_name = train.passport_issuer_name.str.replace(u'[^- -]','') train.passport_issuer_name = train.passport_issuer_name.str.replace(u'- ',' ') train.passport_issuer_name = train.passport_issuer_name.str.replace(u' *',' ')

In the next step, it is necessary to decipher the abbreviations, such as ATC, UFNS, CAD, HLW, etc., because In principle, there are not many of them, but this will have a positive effect on the quality of further coding. For example, if we have two records "ATC" and "management of internal affairs", then they will be encoded in different ways, because for a computer these are different values.

So go to the decoding. And, to begin with, we will get a dictionary of abbreviations, with the help of which we will do the decoding:

 sokr = {u'': u'  ', u'': u'-  ', u'': u'  ', u'': u'-  ', u'': u'  ', u'': u'  ', u'': u'  ', u'c': u'  ', u'': u'  ', u'': u'- ', u'': u'- ', u'': u'- ', u'': u'- ', u'': u'    ', u'': u'   ', u'': u'   ', u'': u'  ', u'': u'   ', u'': u'   ', u'': u'   ', u'': u'  ', u'': u'   ', u'': u' ', u'': u' ', u'': u' '}

Now, we will actually decrypt the abbreviations and format the resulting records:

 for i in sokr.iterkeys(): train.passport_issuer_name = train.passport_issuer_name.str.replace(u'( %s )|(^%s)|(%s$)' % (i,i,i), u' %s ' % (sokr[i])) #        train.passport_issuer_name = train.passport_issuer_name.str.lstrip() train.passport_issuer_name = train.passport_issuer_name.str.rstrip()

The preliminary stage of processing the field "by whom the passport was issued" on this finish. And go to the field in which the date of issue.

As you can see, the data in it is stored in the form: month M year .

Accordingly, you can simply remove the letter “M” and bring the field to a numeric type. But if you think well, then this field can be removed, because for one month in a year, there may be several units issuing a passport, and accordingly this may spoil our model. Based on this, remove it from the sample:

 train = train.drop(['passport_issue_month/year'], axis=1)

Now we can proceed to data analysis.

Data analysis

So, we have data for building a model, but they are in text form. To build a model, it would be nice to encode them in numerical form.

The authors of the scikit-learn package thoughtfully took care of us and added several ways to extract and encode text data. Of these, I like the most two:

FeatureHasher converts a string to a numeric array of a specified length using a hash function (32-bit Murmurhash3 )

CountVectorizer converts input text into a matrix, the values of which are the numbers of occurrences of a given key (word) into text. Unlike FeatureHasher, it has more configurable parameters (for example, you can set a tokenizer ), but it works slower.

For a more accurate understanding of the work of CountVectorizer we give a simple example. Suppose there is a table with text values:

Value
one two Three
three four two two
four times

To begin with, CountVectorizer collects unique keys from all records, in our example it will be:

[one two three four]

The length of the list of unique keys will be the length of our coded text (in our case, this is 4). And the numbers of the elements will correspond to the number of times the meeting of the given key with the given number in the line:

one two three -> [1,1,1,0]

three four two two -> [0,2,1,1]

Accordingly, after encoding, the application of this method we get:

Value
1,1,1,0
0,2,1,1
3,0,0,1

HashingVectorizer is a mixture of the two methods described above. It is possible to adjust the size of the encoded string (as in FeatureHasher ) and configure the tokenizer (as in the CountVectorizer ). In addition, its performance is closer to the FeatureHasher.

So back to the analysis. If we take a closer look at our data set, it can be noted that there are similar lines but recorded in different ways, for example: " ... republics and Karelia ... " and " ... according to the republic , Karelia ... ".

Accordingly, if we try to apply one of the coding methods, we now get very similar values. Such cases can be minimized if all the words in the record are reduced to normal form .

Pymorphy or nltk works well for this task. I will use the first, because He was originally created to work with the Russian language. So, the function that will be responsible for normalizing and cleaning the line looks like this:

 def f_tokenizer(s): morph = pymorphy2.MorphAnalyzer() if type(s) == unicode: t = s.split(' ') else: t = s f = [] for j in t: m = morph.parse(j.replace('.','')) if len(m) <> 0: wrd = m[0] if wrd.tag.POS not in ('NUMR','PREP','CONJ','PRCL','INTJ'): f.append(wrd.normal_form) return f

The function does the following:

It first converts the string to a list.
Then for all words parses
If the word is numeral, predicative, preposition, union, particle or interjection, do not include it in the final set
If the word does not fall into the previous list, take its normal form and add it to the final set

Now, when there is a function for normalization, we can proceed to coding using the CountVectorizer method. It is chosen because it can transfer our function as a tokenizer and it will compile a list of keys using the values obtained as a result of the operation of our function:

 coder = HashingVectorizer(tokenizer=f_tokenizer, n_features=256)

As you can see, when creating a method, in addition to the tokenizer, we set another parameter n_features . This parameter specifies the length of the encoded string (in our case, the string is encoded using 256 columns). In addition, the HashingVectorizer has another advantage over the CountVectorizer , but it can immediately perform value normalization, which is good for algorithms like SVM.

Now apply our encoder to the training set:

 TrainNotDuble = train.drop_duplicates() trn = coder.fit_transform(TrainNotDuble.passport_issuer_name.tolist()).toarray()

Model building

First we need to set the values for the column, which will contain the class labels:

 target = TrainNotDuble.passport_div_code.values

The task that we solve today belongs to the class of classification problems with a variety of classes. To solve this problem, the RandomForest algorithm was best suited . The remaining algorithms showed very poor results (less than 50%), so I decided not to occupy space in the article. If you wish, anyone interested can check these results.

To assess the quality of the classification, we will use the number of documents for which the correct decision was made, i.e.

$LaTeX: Accuracy = \ frac {P} {N}$

where P is the number of documents on which the classifier made the right decision, and N is the size of the training sample.

The scikit-learn package has a function for this: accuracy_score

Before starting building the model itself, let's reduce the dimension using the “principal component method”, since There are a lot of 256 learning columns:

 pca = PCA(n_components = 15) trn = pca.fit_transform(trn)

The model will look like this:

 model = RandomForestClassifier(n_estimators = 100, criterion='entropy') TRNtrain, TRNtest, TARtrain, TARtest = train_test_split(trn, target, test_size=0.4) model.fit(TRNtrain, TARtrain) print 'accuracy_score: ', accuracy_score(TARtest, model.predict(TRNtest))

accuracy_score: 0.6523456

Conclusion

As a conclusion, it should be noted that the resulting accuracy of 65% is close to guessing. In order to improve, you need to handle grammatical errors and various kinds of censuses during primary processing. This action will also have a positive effect on the dictionary when the field is encoded, i.e., its size decreases and, accordingly, the length of the string after its encoding decreases.

In addition, the learning phase of the test sample was omitted specifically, since there is nothing special about it, except for bringing it to the desired form (this can be easily done using the transformation of the training sample as the basis)

In the article I tried to show the minimum list of steps for processing textual information for submitting it to machine learning algorithms. Perhaps making the first steps in data analysis, this information will be useful.

UPD : IPython Notebook Console TKCTask2Answer.ipynb

Source: https://habr.com/ru/post/205360/

All Articles