Introduction
Good afternoon, dear readers.
Recently, wandering through the expanses of the global web, I stumbled upon a
tournament that was held by TCS at the beginning of this year. After reviewing the tasks, I decided to test my skills in analyzing data on them.
I decided to start checking with the scoring problem (Task â„–3). To solve it, I, as always, used Python with analytical modules
pandas and
scikit-learn .
Data Description and Problem Statement
The bank requests the applicant's credit history at the three largest Russian credit bureaus. A selection of Bank clients is provided in the
SAMPLE_CUSTOMERS.CSV file. The sample is divided into parts “train” and “test”. From the “train” sample, we know the value of the target variable bad - the presence of a “default” (assuming the client has delayed 90 days or more during the first year of using the loan). The
SAMPLE_ACCOUNTS.CSV file
contains data from the responses of credit bureaus to all inquiries about the relevant clients.
Data format
SAMPLE_CUSTOMERS - information about the possibility of default of a certain person.
Description of the format of the dataset
SAMPLE_ACCOUNTS :
Kit DescriptionName | Description |
---|
TCS_CUSTOMER_ID | Customer ID |
BUREAU_CD | Bureau code from which the invoice is received |
BKI_REQUEST_DATE | Date on which the request was made to the bureau |
CURRENCY | Currency of contract (ISO letter code of currency) |
RELATIONSHIP | Type of relationship to the contract |
| 1 - Individual |
2 - Additional card / Authorized user |
4 - Joint |
5 - Guarantor |
9 - Legal entity |
OPEN_DATE | Date of opening of the contract |
FINAL_PMT_DATE | Date of final payment (scheduled) |
TYPE | Contract Type Code |
| 1 - Car loan |
4 - Leasing |
6 - Mortgage |
7 - Credit Card |
9 - Consumer credit |
10 - Business Development Credit |
11 - Loan for replenishment of working capital |
12 - Credit for the purchase of equipment |
13 - Real estate construction loan |
14 - Credit for the purchase of shares (for example, margin lending) |
99 - Other |
PMT_STRING_84M | Discipline (timeliness) of payments. The string is made up of account status codes at the time the bank sends data on the account to the bureau, the first character is the state as of date PMT_STRING_START, then successively in descending order of dates.
|
| 0 - New, evaluation is not possible |
X - No information |
1 - Payment without delay |
A - Delay from 1 to 29 days |
2 - Delay from 30 to 59 days |
3 - Delay from 60 to 89 days |
4 - Delay from 90 to 119 days |
5 - Delay more than 120 days |
7 - Regular consolidated payments |
8 - Loan repayment using collateral |
9 - Bad debt / transferred for collection / missed payment |
STATUS | Contract status |
| 00 - Active |
12 - Paid by collateral |
13 - Account is closed |
14 - Transferred to another bank for service |
21 - Dispute |
52 - Overdue |
61 - Return Problems |
OUTSTANDING | Remaining outstanding debt. Amount in rubles at the rate of the Central Bank of the Russian Federation |
NEXT_PMT
| The amount of the next payment. Amount in rubles at the rate of the Central Bank of the Russian Federation |
INF_CONFIRM_DATE | Date of confirmation of account information |
FACT_CLOSE_DATE
| Account closure date (actual)
|
TTL_DELQ_5
| Number of delinquency up to 5 days
|
TTL_DELQ_5_29
| The number of delinquencies from 5 to 29 days
|
TTL_DELQ_30_59
| The number of delinquencies from 30 to 59 days
|
TTL_DELQ_60_89
| The number of delinquencies from 60 to 89 days
|
TTL_DELQ_30
| Number of delinquency up to 30 days
|
TTL_DELQ_90_PLUS
| 90+ days overdue
|
PMT_FREQ
| Payment Rate Code
|
| 1 - Weekly
|
2 - Every other week
|
3 - Monthly
|
A - Once every 2 months
|
4 - Quarterly
|
B - Once every 4 months
|
5 - Every six months
|
6 - Annually
|
7 - Other
|
CREDIT_LIMIT
| Credit limit. Amount in rubles at the rate of the Central Bank of the Russian Federation
|
DELQ_BALANCE
| Current arrears. Amount in rubles at the rate of the Central Bank of the Russian Federation
|
MAX_DELQ_BALANCE
| The maximum amount of arrears. Amount in rubles at the rate of the Central Bank of the Russian Federation
|
CURRENT_DELQ
| The current number of days overdue
|
PMT_STRING_START
| Start date of the line PMT_STRING_84M
|
INTEREST_RATE
| Loan interest rate
|
CURR_BALANCE_AMT
| The total amount paid, including the amount of principal, interest, penalties and fines. Amount in rubles at the rate of the Central Bank of the Russian Federation
|
The task is to build a model that determines the probability of "default" on
the train sample and put its probabilities on clients from
the test sample. To evaluate the model, the
Area Under ROC Curve characteristic will be used (also indicated in the problem conditions).
Preliminary data processing
First, download the source files and look at them:
from pandas import read_csv, DataFrame from sklearn.metrics import roc_curve from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.cross_validation import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.neighbors import KNeighborsClassifier from sklearn.decomposition import PCA import ml_metrics, string, re, pylab as pl SampleCustomers = read_csv("https://static.tcsbank.ru/documents/olymp/SAMPLE_CUSTOMERS.csv", ';') SampleAccounts = read_csv("https://static.tcsbank.ru/documents/olymp/SAMPLE_ACCOUNTS.csv",";",decimal =',') print SampleAccounts
')
SampleCustomers.head()
| tcs_customer_id | bad | sample_type |
---|
0 | one | NaN | test |
---|
one | 2 | 0 | train |
---|
2 | 3 | one | train |
---|
3 | four | 0 | train |
---|
four | five | 0 | train |
---|
From the conditions of the problem, we can assume that the SampleAccounts set contains several records for one borrower, let's check this:
SampleAccounts.tcs_customer_id.drop_duplicates().count(), SampleAccounts.tcs_customer_id.count()
Our assumption turned out to be true. Unique borrowers 50000 of 280942 records. This is due to the fact that one borrower has several credits and for each of them there will be different information in different bureaus. Therefore, you need to perform transformations on SampleAccounts to match one line to one borrower.
Now let's get a list of all unique loans for each borrower:
SampleAccounts[['tcs_customer_id','open_date','final_pmt_date','credit_limit','currency']].drop_duplicates()
Therefore, when we received a list of credits, we will be able to display any general information on each element of the list. Those. it would be possible to take a bundle of the above fields and make it an index with respect to which we would perform further manipulations, but, unfortunately, there is one unpleasant moment that lies in wait for us. It lies in the fact that the 'final_pmt_date' field in the data set has blank values. Let's try to get rid of them.
We have an actual closing date of the loan in the set, therefore, if it is, and the 'final_pmt_date' field is not filled, then we can write this value into it. For the rest, just write 0.
SampleAccounts.final_pmt_date[SampleAccounts.final_pmt_date.isnull()] = SampleAccounts.fact_close_date[SampleAccounts.final_pmt_date.isnull()].astype(float) SampleAccounts.final_pmt_date.fillna(0, inplace=True)
Now that we’ve gotten rid of empty values, let's get the most recent date for contacting any of the bureaus for each of the loans. It is useful for us to determine its attributes, such as contract status, type, etc.
sumtbl = SampleAccounts.pivot_table(['inf_confirm_date'], ['tcs_customer_id','open_date','final_pmt_date','credit_limit','currency'], aggfunc='max') sumtbl.head(15)
| | | | | inf_confirm_date |
---|
tcs_customer_id | open_date | final_pmt_date | credit_limit | currency | |
---|
one | 39261 | 39629 | 19421 | RUB | 39924 |
39505 | 39870 | 30,000 | RUB | 39862 |
39644 | 40042 | 11858 | RUB | 40043 |
39876 | 41701 | 300,000 | RUB | 40766 |
39942 | 40308 | 19691 | RUB | 40435 |
40421 | 42247 | 169,000 | RUB | 40756 |
40428 | 51386 | 10,000 | RUB | 40758 |
40676 | 41040 | 28967 | RUB | 40764 |
2 | 40472 | 40618 | 7551 | RUB | 40661 |
40652 | 40958 | 21186 | RUB | 40661 |
3 | 39647 | 40068 | 22694 | RUB | 40069 |
40604 | 0 | 20,000 | RUB | 40624 |
four | 38552 | 40378 | 75,000 | RUB | 40479 |
39493 | 39797 | 5000 | RUB | 39823 |
39759 | 40123 | 6023 | RUB | 40125 |
Now add the dates we received to the main set:
SampleAccounts = SampleAccounts.merge(sumtbl, 'left', left_on=['tcs_customer_id','open_date','final_pmt_date','credit_limit','currency'], right_index=True, suffixes=('', '_max'))
So, then we divide the columns in which the parameters are strictly defined, so that each value from these fields corresponds to a separate column. By condition, the columns with the specified values ​​will be:
- pmt_string_84m
- pmt_freq
- type
- status
- relationship
- bureau_cd
The code for converting them is shown below:
The next step is to transform the 'fact_close_date' field, which contains the date of the last actual payment, so that it contains only 2 values:
- 0 - there was no last payment
- 1 - last payment was
I made this replacement because the field was originally half filled.
SampleAccounts.fact_close_date[SampleAccounts.fact_close_date.notnull()] = 1 SampleAccounts.fact_close_date.fillna(0, inplace=True)
Now from our data set we need to pull out the latest data on all loans. The
“inf_confirm_date_max” field obtained above will help us in this. In it, we added the deadline for updating loan information in all bureaus:
PreFinalDS = SampleAccounts[SampleAccounts.inf_confirm_date == SampleAccounts.inf_confirm_date_max].drop_duplicates()
After the above actions, our sample was significantly reduced, but now we need to summarize all the information on the loan and the borrower obtained earlier. To do this, we group our data set:
PreFinalDS = PreFinalDS.groupby(['tcs_customer_id','open_date','final_pmt_date','credit_limit','currency']).max().reset_index()
Our data is almost ready to start the analysis. It remains to perform a few more actions:
- Remove unwanted columns
- Bring all credit limits in rubles
- Calculate how many credits each borrower has from bureau information
Let's start by clearing the table of unnecessary columns:
PreFinalDS = PreFinalDS.drop(['bki_request_date', 'inf_confirm_date', 'pmt_string_start', 'interest_rate', 'open_date', 'final_pmt_date', 'inf_confirm_date_max'], axis=1)
Further we will transfer all credit limits to rubles. For simplicity, I took the current exchange rates. Although probably it would be more correct to take a course at the time of opening an account. Another caveat is that for analysis we need to remove the
“surrency” text field, so after converting currencies to rubles, we will
perform the manipulation with this field, which we did with the fields above:
curs = DataFrame([33.13,44.99,36.49,1], index=['USD','EUR','GHF','RUB'], columns=['crs']) PreFinalDS = PreFinalDS.merge(curs, 'left', left_on='currency', right_index=True) PreFinalDS.credit_limit = PreFinalDS.credit_limit * PreFinalDS.crs
So before the final grouping, add to our set a field filled with units. Those. when we perform the last grouping, the amount on it will give the number of loans from the borrower:
PreFinalDS['count_credit'] = 1
Now that we have all the quantitative data in the data set, we can fill in the gaps in data 0 and perform the final grouping by customer:
PreFinalDS.fillna(0, inplace=True) FinalDF = PreFinalDS.groupby('tcs_customer_id').sum() FinalDF
Preliminary analysis
Well, the primary data processing is completed and you can begin to analyze them. To begin with, we divide our data into training and test samples. The column
“sample_type” from SampleCustomers will help us with this; this division is made just by it.
In order to break our processed DataFrame, it is enough to combine it with SampleCustomers to play around with filters:
SampleCustomers.set_index('tcs_customer_id', inplace=True) UnionDF = FinalDF.join(SampleCustomers) trainDF = UnionDF[UnionDF.sample_type == 'train'].drop(['sample_type'], axis=1) testDF = UnionDF[UnionDF.sample_type == 'test'].drop(['sample_type'], axis=1)
Next, let's see how the signs correlate with each other, for this we construct a matrix with the coefficients correlation coefficients. With pandas, this can be done with one command:
CorrKoef = trainDF.corr()
After the action above, CorrKoef will contain a matrix of sizes 61x61.
The rows and columns will be the corresponding field names, and at their intersection the value of the correlation coefficient. For example:
| fact_close_date |
---|
status_13 | 0.997362 |
---|
It is possible that there is no correlation coefficient. This means that these fields are most likely filled with only one identical value and can be omitted in the analysis. Check:
FieldDrop = [i for i in CorrKoef if CorrKoef[i].isnull().drop_duplicates().values[0]]
At the exit, we received a list of fields that can be deleted:
- pmt_string_84m_6
- pmt_string_84m_8
- pmt_freq_5
- pmt_freq_A
- pmt_freq_B
- status_12
The next step is to find the fields that correlate with each other (whose correlation coefficient is more than 90%) using our matrix:
CorField = [] for i in CorrKoef: for j in CorrKoef.index[CorrKoef[i] > 0.9]: if i <> j and j not in CorField and i not in CorField: CorField.append(j) print "%s-->%s: r^2=%f" % (i,j, CorrKoef[i][CorrKoef.index==j].values[0])
At the output we get the following:
fact_close_date -> status_13: r ^ 2 = 0.997362
ttl_delq_5_29 -> ttl_delq_30: r ^ 2 = 0.954740
ttl_delq_5_29 -> pmt_string_84m_A: r ^ 2 = 0.925870
ttl_delq_30_59 -> pmt_string_84m_2: r ^ 2 = 0.903337
ttl_delq_90_plus -> pmt_string_84m_5: r ^ 2 = 0.978239
delq_balance -> max_delq_balance: r ^ 2 = 0.986967
pmt_freq_3 -> relationship_1: r ^ 2 = 0.909820
pmt_freq_3 -> currency_RUB: r ^ 2 = 0.910620
pmt_freq_3 -> count_credit: r ^ 2 = 0.911109
So, based on the links we got in the previous step, we can add the following fields to the delete list:
FieldDrop =FieldDrop + ['fact_close_date','ttl_delq_30', 'pmt_string_84m_5', 'pmt_string_84m_A', 'pmt_string_84m_A', 'max_delq_balance', 'relationship_1', 'currency_RUB', 'count_credit'] newtr = trainDF.drop(FieldDrop, axis=1)
Build and select models
Well, the primary data is processed and now you can proceed to the construction of the model.
Let's separate the class attribute from the training sample:
target = newtr.bad.values train = newtr.drop('bad', axis=1).values
Now let's reduce the dimension of our sample in order to take only significant parameters. To do this, we use
the principal component method and its implementation of
PCA () in the
sklearn module. In the parameter, we pass the number of components that we want to save (I chose 20, since with them the results of the models practically did not differ from the results on the initial data)
coder = PCA(n_components=20) train = coder.fit_transform(train)
The time has come to define classification models. Take several different algorithms and compare the results of their work using the characteristics of the
Area Under ROC Curve (
auc ). The following algorithms will be considered for modeling:
models = [] models.append(RandomForestClassifier(n_estimators=165, max_depth=4, criterion='entropy')) models.append(GradientBoostingClassifier(max_depth =4)) models.append(KNeighborsClassifier(n_neighbors=20)) models.append(GaussianNB())
So the models are selected. Let's now break down our training sample into 2 subsamples: test and training. This action is necessary so that we can calculate the characteristic auc for our models. The splitting can be performed
using the train_test_split () function from the
sklearn module:
TRNtrain, TRNtest, TARtrain, TARtest = train_test_split(train, target, test_size=0.3, random_state=0)
It remains to train our models and evaluate the result.
To calculate the characteristics of
auc there are 2 ways:
- Standard means of the sklearn module using the roc_auc_score or auc function
- Using the third-party package ml_metrics and auc () function
I will use the second method, because The first was shown in a previous article. The
ml_metrics package is a very useful addition to sklearn, since it contains some metrics that are not in sklearn.
So, we will construct ROC curves and we will count their areas:
plt.figure(figsize=(10, 10)) for model in models: model.fit(TRNtrain, TARtrain) pred_scr = model.predict_proba(TRNtest)[:, 1] fpr, tpr, thresholds = roc_curve(TARtest, pred_scr) roc_auc = ml_metrics.auc(TARtest, pred_scr) md = str(model) md = md[:md.find('(')] pl.plot(fpr, tpr, label='ROC fold %s (auc = %0.2f)' % (md, roc_auc)) pl.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6)) pl.xlim([0, 1]) pl.ylim([0, 1]) pl.xlabel('False Positive Rate') pl.ylabel('True Positive Rate') pl.title('Receiver operating characteristic example') pl.legend(loc="lower right") pl.show()
So, according to the analysis of our models, we can say that the gradient boosting showed itself best, its accuracy is about 69%. Accordingly, for learning a test sample, we will select it. Let's fill in the information in the test sample, having previously processed it to the required format:
Conclusion
As a conclusion, I would like to note that the resulting accuracy of the model in 69% is not good enough, but I could not achieve greater accuracy. I would like to note the fact that when building a model on the full dimension, i.e. without taking into account correlated columns and reducing the dimension, it also gave 69% accuracy (this can be easily checked using the trainDF kit to train the model)
In this article, I tried to show all the main stages of data analysis from the primary processing of raw data to the construction of the classifier model. In addition, I would like to note that the reference vector method was not included in the analyzed models, this is due to the fact that after normalizing the data, the accuracy of the model dropped to 51% and the best result I was able to get with it was around 60%, with significant time consuming.
I would also like to note that, unfortunately on the test sample, the result could not be verified, since did not meet the timing of the tournament.