Introduction
Today I will continue the story about the application of data analysis and machine learning methods with practical examples. In the last
article we dealt with the task of credit scoring. Below, I will try to demonstrate the solution of another problem from the same
tournament , namely, “Passport Tasks” (Task No. 2).
The solution will show the basics of textual information analysis, as well as its coding for building a model using Python and data analysis modules (
pandas ,
scikit-learn ,
pymorphy ).
Formulation of the problem
When working with large amounts of data, it is important to keep them clean. And when filling in an application for a banking product, it is necessary to indicate full passport data, including the “passport issued by whom” field; the number of different spellings of the same department by potential clients can reach several hundred. It is important to understand whether the client was not mistaken in filling in other fields: "division code", "passport number / series". To do this, you need to check the "division code" and "who issued the passport."
The task is to put down the codes of divisions for the records from the
test sample , based on the
training sample .
Preliminary data processing
Load the data and see what we have:
from pandas import read_csv import pymorphy2 from sklearn.feature_extraction.text import HashingVectorizer from sklearn.cross_validation import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, roc_auc_score from sklearn.decomposition import PCA train = read_csv('https://static.tcsbank.ru/documents/olymp/passport_training_set.csv',';', index_col='id' ,encoding='cp1251') train.head(5)
| passport_div_code | passport_issuer_name | passport_issue_month / year |
---|
id | | | |
---|
one | 422008 | BELOVSKY ATC KEMEROVSK REGION | 11M2001 |
---|
2 | 500112 | TP â„–2 V GOR. NUT-ZUYEVO OUFMS RUSSIA IN MO ... | 03M2009 |
---|
3 | 642001 | VOLGA ROVD GOR.SARATOV | 04M2002 |
---|
four | 162004 | ATC MOSCOW DISTRICT KAZAN | 12M2002 |
---|
five | 80001 | THE DEPARTMENT OF OF RUSSIAN FEDERATION BY RESP. KALMYKIA IN G ELIST | 08M2009 |
---|
Now you can see how users record the "passport issued by anyone" field using the example of a department:
')
example_code = train.passport_div_code[train.passport_div_code.duplicated()].values[0] for i in train.passport_issuer_name[train.passport_div_code == example_code].drop_duplicates(): print i
DEPARTMENT OF UFMS OF RUSSIA IN THE REPUBLIC OF KARELIA IN BEAR. -
THE DEPARTMENT OF THE UFMS OF RUSSIA ACCORDING TO R. KARELIA IN MEDVEZHEGORSK REGION
DEPARTMENT OF UFMS OF RUSSIA IN RESP KARELIA IN MEDVEZHEGORSK reg.
DEPARTMENT OF UFMS OF RUSSIA IN THE REPUBLIC OF KARELIA IN MEDVEZHEGORSK DISTRICT
OUFMS RUSSIA IN THE REPUBLIC OF KARELIA IN MEDVEZHEGORSK DISTRICT
UFMS of Russia in Kazakhstan in Medvezhiegorsky district
THE DEPARTMENT OF THE UFMS OF RUSSIA IN THE REPUBLIC OF KARELIA BEARS OF MEDVEZHEGORSK R-ONE
DEPARTMENT OF UFMS OF RUSSIA IN RK IN MEDVEZHEGORSK DISTRICT
THE DEPARTMENT OF THE UFMS OF RUSSIA IN THE REPUBLIC OF KORELIA IN MEDVAJIGOR DISTRICT
UFMS OF RUSSIA ACROSS KARELIA OF MEDVEJEGORSKA REGION
THE DEPARTMENT OF UFMS OF RUSSIA IN THE REPUBLIC OF KARELIA IN MEDWEZHORGIA
UFMS REPUBLIC OF KARELIA BEARSHIP R-ON
BEAR OF LAW
As you can see, the field is really filled crookedly. But for normal coding, we must bring this field to a more or less normal (unambiguous) form.
To begin with, I would suggest to bring all the entries to one register, for example, so that all letters become lowercase. This is easily done using the str attribute, a DataFrame column. This attribute allows you to work with a column as with a string, as well as perform various kinds of search and replace for regular expressions:
train.passport_issuer_name = train.passport_issuer_name.str.lower() train[train.passport_div_code == example_code].head(5)
| passport_div_code | passport_issuer_name | passport_issue_month / year |
---|
id | | | |
---|
nineteen | 100010 | Department of the UFMS of Russia in the Republic of Karelia in ... | 04M2008 |
---|
22 | 100010 | Branch Ufms Russia on the river. Karelia in the bear ... | 10M2009 |
---|
5642 | 100010 | Department of the Ufms of Russia in Karelia in Medve ... | 08M2008 |
---|
6668 | 100010 | Department of the UFMS of Russia in the Republic of Karelia in ... | 08M2011 |
---|
8732 | 100010 | Department of the UFMS of Russia in the Republic of Karelia in ... | 08M2012 |
---|
C register defined. Then you need to get rid of popular cuts, for example, a district, a city, etc. We do this with regular expressions.
Pandas provides convenient use of regular expressions for each column. It looks like this:
train.passport_issuer_name = train.passport_issuer_name.str.replace(u'-(||||)*',u'') train.passport_issuer_name = train.passport_issuer_name.str.replace(u' ( |\.|((\.| )))', u' ') train.passport_issuer_name = train.passport_issuer_name.str.replace(u' (\.| )', u' ') train.passport_issuer_name = train.passport_issuer_name.str.replace(u' ([-]*)(\.)?', u' ') train.passport_issuer_name = train.passport_issuer_name.str.replace(u' (\.| |( )?)', u' ') train.passport_issuer_name = train.passport_issuer_name.str.replace(u' ', u' ')
Now we will get rid of all unnecessary characters, except Russian letters, hyphens and spaces. This is due to the fact that a passport of the same subdivision can be issued by departments with different numbers, and this will worsen further coding:
train.passport_issuer_name = train.passport_issuer_name.str.replace(u' - ?', u'-') train.passport_issuer_name = train.passport_issuer_name.str.replace(u'[^- -]','') train.passport_issuer_name = train.passport_issuer_name.str.replace(u'- ',' ') train.passport_issuer_name = train.passport_issuer_name.str.replace(u' *',' ')
In the next step, it is necessary to decipher the abbreviations, such as ATC, UFNS, CAD, HLW, etc., because In principle, there are not many of them, but this will have a positive effect on the quality of further coding. For example, if we have two records "ATC" and "management of internal affairs", then they will be encoded in different ways, because for a computer these are different values.
So go to the decoding. And, to begin with, we will get a dictionary of abbreviations, with the help of which we will do the decoding:
sokr = {u'': u' ', u'': u'- ', u'': u' ', u'': u'- ', u'': u' ', u'': u' ', u'': u' ', u'c': u' ', u'': u' ', u'': u'- ', u'': u'- ', u'': u'- ', u'': u'- ', u'': u' ', u'': u' ', u'': u' ', u'': u' ', u'': u' ', u'': u' ', u'': u' ', u'': u' ', u'': u' ', u'': u' ', u'': u' ', u'': u' '}
Now, we will actually decrypt the abbreviations and format the resulting records:
for i in sokr.iterkeys(): train.passport_issuer_name = train.passport_issuer_name.str.replace(u'( %s )|(^%s)|(%s$)' % (i,i,i), u' %s ' % (sokr[i]))
The preliminary stage of processing the field "by whom the passport was issued" on this finish. And go to the field in which the date of issue.
As you can see, the data in it is stored in the form:
month M year .
Accordingly, you can simply remove the letter “M” and bring the field to a numeric type. But if you think well, then this field can be removed, because for one month in a year, there may be several units issuing a passport, and accordingly this may spoil our model. Based on this, remove it from the sample:
train = train.drop(['passport_issue_month/year'], axis=1)
Now we can proceed to data analysis.
Data analysis
So, we have data for building a model, but they are in text form. To build a model, it would be nice to encode them in numerical form.
The authors of the
scikit-learn package thoughtfully took care of us and added
several ways to extract and encode text data. Of these, I like the most two:
- FeatureHasher
- CountVectorizer
- HashingVectorizer
FeatureHasher converts a string to a numeric array of a specified length using a hash function (32-bit
Murmurhash3 )
CountVectorizer converts input text into a matrix, the values ​​of which are the numbers of occurrences of a given key (word) into text. Unlike FeatureHasher, it has more configurable parameters (for example, you can set a
tokenizer ), but it works slower.
For a more accurate understanding of the work of CountVectorizer we give a simple example. Suppose there is a table with text values:
Value |
---|
one two Three |
three four two two |
four times |
To begin with, CountVectorizer collects unique keys from all records, in our example it will be:
[one two three four]
The length of the list of unique keys will be the length of our coded text (in our case, this is 4). And the numbers of the elements will correspond to the number of times the meeting of the given key with the given number in the line:
one two three -> [1,1,1,0]
three four two two -> [0,2,1,1]
Accordingly, after encoding, the application of this method we get:
Value |
---|
1,1,1,0 |
0,2,1,1 |
3,0,0,1 |
HashingVectorizer is a mixture of the two methods described above. It is possible to adjust the size of the encoded string (as in
FeatureHasher ) and configure the tokenizer (as in the
CountVectorizer ). In addition, its performance is closer to the FeatureHasher.
So back to the analysis. If we take a closer look at our data set, it can be noted that there are similar lines but recorded in different ways, for example: "
... republics and Karelia ... " and "
... according to the republic , Karelia ... ".
Accordingly, if we try to apply one of the coding methods, we now get very similar values. Such cases can be minimized if all the words in the record are reduced to
normal form .
Pymorphy or
nltk works well for this task. I will use the first, because He was originally created to work with the Russian language. So, the function that will be responsible for normalizing and cleaning the line looks like this:
def f_tokenizer(s): morph = pymorphy2.MorphAnalyzer() if type(s) == unicode: t = s.split(' ') else: t = s f = [] for j in t: m = morph.parse(j.replace('.','')) if len(m) <> 0: wrd = m[0] if wrd.tag.POS not in ('NUMR','PREP','CONJ','PRCL','INTJ'): f.append(wrd.normal_form) return f
The function does the following:
- It first converts the string to a list.
- Then for all words parses
- If the word is numeral, predicative, preposition, union, particle or interjection, do not include it in the final set
- If the word does not fall into the previous list, take its normal form and add it to the final set
Now, when there is a function for normalization, we can proceed to coding using the
CountVectorizer method. It is chosen because it can transfer our function as a tokenizer and it will compile a list of keys using the values ​​obtained as a result of the operation of our function:
coder = HashingVectorizer(tokenizer=f_tokenizer, n_features=256)
As you can see, when creating a method, in addition to the tokenizer, we set another parameter
n_features . This parameter specifies the length of the encoded string (in our case, the string is encoded using 256 columns). In addition, the
HashingVectorizer has another advantage over the
CountVectorizer , but it can immediately perform value normalization, which is good for algorithms like SVM.
Now apply our encoder to the training set:
TrainNotDuble = train.drop_duplicates() trn = coder.fit_transform(TrainNotDuble.passport_issuer_name.tolist()).toarray()
Model building
First we need to set the values ​​for the column, which will contain the class labels:
target = TrainNotDuble.passport_div_code.values
The task that we solve today belongs to the class of classification problems with a variety of classes. To solve this problem, the
RandomForest algorithm was
best suited . The remaining algorithms showed very poor results (less than 50%), so I decided not to occupy space in the article. If you wish, anyone interested can check these results.
To assess the quality of the classification, we will use the number of documents for which the correct decision was made, i.e.
where
P is the number of documents on which the classifier made the right decision, and
N is the size of the training sample.
The scikit-learn package has a function for this:
accuracy_score
Before starting building the model itself, let's reduce the dimension using the “principal component method”, since There are a lot of 256 learning columns:
pca = PCA(n_components = 15) trn = pca.fit_transform(trn)
The model will look like this:
model = RandomForestClassifier(n_estimators = 100, criterion='entropy') TRNtrain, TRNtest, TARtrain, TARtest = train_test_split(trn, target, test_size=0.4) model.fit(TRNtrain, TARtrain) print 'accuracy_score: ', accuracy_score(TARtest, model.predict(TRNtest))
accuracy_score: 0.6523456
Conclusion
As a conclusion, it should be noted that the resulting accuracy of 65% is close to guessing. In order to improve, you need to handle grammatical errors and various kinds of censuses during primary processing. This action will also have a positive effect on the dictionary when the field is encoded, i.e., its size decreases and, accordingly, the length of the string after its encoding decreases.
In addition, the learning phase of the test sample was omitted specifically, since there is nothing special about it, except for bringing it to the desired form (this can be easily done using the transformation of the training sample as the basis)
In the article I tried to show the minimum list of steps for processing textual information for submitting it to machine learning algorithms. Perhaps making the first steps in data analysis, this information will be useful.
UPD :
IPython Notebook Console TKCTask2Answer.ipynb