📜 ⬆️ ⬇️

Machine learning in microfinance: building a scoring model for customers with a blank credit history

No credit history - do not give loans, do not give loans - there is no credit history. A vicious circle of some kind. What to do? Let's figure it out.


Hello! My name is Mark, I am a data scientist at Devim. Recently, we launched a model for scoring borrowers of the “Before Salary” IFC, which have no credit history. I want to share the experience of data search, features of design and interpretation of signs.



This topic is divided into two publications, in the first I will talk about the process of searching and constructing features. In the second part, on comparing model architectures, analyzing results and interpreting scoring solutions.


Part one. Construction of signs


Machine learning models are based on data, the quality and completeness of which is the determining factor in the success or failure of the model. And what if the data is small? Or if the data is not informative enough or not accurate? Where to find additional information and how to use it when building a model? Let's tell you how I solved this problem.


Factors for assessing credit risk


Credit scoring is based on an analysis of the borrower's characteristics associated with the risk of loan default. They can be divided into general economic and individual.


General economic factors


The economic situation has a serious impact on the financial and psychological state of the borrower. More accurately assess the degree of influence, you can select the factors related to the borrower. Conventionally, they are divided into two levels:



Individual factors


Individual factors contain the most valuable information for the scoring model. They can also be divided into categories:



Dataset description


The set for training the model is 9,500 borrowers who first received a loan from May to December 2018. Data for testing - 1500 borrowers for the period from January to March 2019.


Temporary separation of borrowers is used for several reasons. First, this separation makes it unlikely to leak information from the future. Secondly, it allows us to estimate the stability of the model over time. In the PDL ( Payday loan ) microloans, the amounts and terms are small compared to other types of loans, therefore, the following is selected as the target indicator: overdue payments by more than 15 days.


Construction of signs


Construction of signs begin with the more general - economic, then move on to the individual.


From the general macroeconomic factors, only one was found, which is consistently available and regularly updated factor - the ruble exchange rate. It is available on the website of the Central Bank for a long period of time (it is possible to upload data in a convenient format), and most importantly, it is updated daily. The ruble exchange rate has a stable downward trend. In the raw form, this factor is better not to use. After a certain period of time, the values ​​of the feature will go beyond the data in the training sample and will be incorrectly interpreted by the model.


To avoid negative consequences, we will convert the ruble exchange rate in relation to the current rate (at the time of consideration of the application) to the median value for the previous 35 days. Now the feature characterizes not the absolute value of the ruble exchange rate, but the tendency (growth, decline, stable state) in the period under consideration. On chart 1 the data obtained. On chart 2 - the percentage of default customers by category (fall, stability, growth).



Chart 1. Change in the ruble exchange rate, relative to the median value for the last 35 days.



Chart 2. The number of default clients depending on the course change.


From economic micro-factors are available: the region in which the borrower works, the type of organization, the profession.


At first glance, the region of work relates to individual factors rather than to general economic ones. However, it is possible to add general economic information to the data through the grouping of regions. The site Rosstat available information on various economic indicators of a particular region. The data on the average level of wages in the region, the cost of a fixed set of products and the amount of overdue loan payments per capita have influenced the probability of default. For grouping the regions, the agglomerative clustering algorithm was chosen. The Ward method, which unites the clusters in such a way that the dispersion increment is minimal, was used as a criterion of communication. The resulting data clusters are on three-dimensional graphics.



Table of grouped regions
one23fourfive6
Belgorod regionMoscow regionKaluga regionRyazan OblastTyumen regionRepublic of Crimea
Bryansk regionMoscowRepublic of KareliaSmolensk regionThe Republic of Sakha (Yakutia)Sevastopol
Vladimir regionKomi RepublicArkhangelsk regionTver regionMagadan RegionThe Republic of Dagestan
Voronezh regionMurmansk regionLeningrad regionTula regionThe Republic of Ingushetia
Ivanovo regionSt. PetersburgPerm regionVologodskaya OblastChechen Republic
Kostroma regionKamchatka KraiSverdlovsk regionKaliningrad region
Kursk regionSakhalin regionKrasnoyarsk regionNovgorod region
Lipetsk regionIrkutsk regionRepublic of Kalmykia
Oryol RegionNovosibirsk regionKrasnodar region
Tambov RegionKhabarovsk regionAstrakhan region
Yaroslavskaya oblastAmur regionRostov region
Pskov regionRepublic of Bashkortostan
Republic of AdygeaRepublic of Tatarstan
Volgograd regionUdmurtia
Kabardino-Balkaria R.Chuvash Republic
Karachay-Cherkessia R.Kirov region
Republic of North Ossetia - AlaniaNizhny Novgorod Region
Stavropol regionOrenburg region
Mari El RepublicSamara Region
The Republic of MordoviaUlyanovsk region
Penza regionKurgan region
Saratov regionChelyabinsk region
Altai RepublicThe Republic of Buryatia
Altai regionTyva Republic
The Republic of Khakassia
Transbaikal region
Kemerovo region
Omsk region
Tomsk region
Primorsky Krai

Another important microeconomic factor is the profession. The figure below shows data on the share of default customers by profession from the training data set.



The graph clearly shows the dependence of the probability of default on the profession. To group borrowers, it is desirable to apply one of the generally accepted principles in the economic community. The breakdown into categories from the Rosstat site correlates well with the data presented in the graph.


Division of employees by personnel category
By categories of personnel, employees are divided into managers, specialists, other employees and workers.
  • The managers include employees holding the positions of heads of organizations, structural divisions and their deputies (directors, heads: departments, divisions, shifts, etc., heads: production, canteen, section, warehouse, laundry, club, hostel, luggage room and others, managers, chairmen, captains, chief accountants and engineers, foremen, etc.).
  • Specialists include employees who work in jobs that usually require higher or secondary vocational education: engineers, doctors, teachers, economists, accountants, geologists, dispatchers, inspectors, proofreaders, mathematicians, nurses, mechanics, rate setters, programmers, psychologists, editors, auditors, etc. Specialists also include assistants and assistants of the above mentioned names of specialists.
  • Other employees are employees who prepare and execute documentation, accounting and control, business management, in particular, agents, archivists, attendants, clerks, cashiers and supervisors (except workers), commandants, copywriters of technical documentation, secretaries, typists, supervisors, statisticians, stenographers, timekeepers, accountants, draftsmen.
  • Workers include persons who are directly involved in the process of creating wealth, as well as those engaged in repairing, moving goods, carrying passengers, providing material services, etc.


Frequently encountered professions, such as driver, manager, accountant, etc., can characterize a borrower in different ways, depending on the specific area or type of organization. For example, a driver working in a taxi and a driver working in the city administration are completely different borrowers.


To add this information to the model, we divide borrowers by the type of organizations in which they work:



In order to check if the division of information adds, let's look at the chart “the proportion of default borrowers grouped by profession and type of organization”.



Designation of professions and types of organizations
professiontype of work
0not specified0not specified
oneexecutivesonecommercial
2the specialists2state
3other employees3un, self employed
fourworkersfournot working
fiveother

From the graph it is clear that for some professions the difference is significant in what type of organization the borrower works. Unexpected results are obtained when the borrower indicates that it does not work, but at the same time indicates the profession. Additional analysis of the data showed that this behavior is typical for pensioners.


And the last general economic factor used in the model is the day of the month on which the loan application was filed. This is probably due to generally accepted wage payment rules in Russia (for example, 10 and 25). The days of the month are divided into two periods from the 9th to the 21st day inclusive and the rest of the month.


Individual factors


Demographic


In my data, there are only four demographic characteristics:



Financial


In the data on borrowers there is information about wages and additional income. The value of these factors is often overestimated by customers, so they do not contain accurate information about the financial position of the borrower, but allow to estimate it approximately.


Psychological


The selected set of borrowers has no loans, so we do not have basic psychological (behavioral) information. But 90% of clients have information on the number of requests for credit history per year, quarter, month, week, day, hour. Thus, it is possible to estimate the need for a loan at the moment and the need for a loan from a historical perspective. The number of applications for a loan submitted for a short period adds information about the psycho-type of the borrower. (whether he submitted one application and waits for the decision, and then submits the second if he refuses. In this case, there will be few loans in the last hour, but many in the last day. Or the borrower submits applications to different organizations and waits for the decision from everyone at once.)


Contact Information


When submitting an application, filling in your own contact information is required. It is also desirable to provide contact details of two closely acquainted people. What allows to form two additional binary signs:



As a result, we obtain the following signs:


  1. Change in ruble exchange rate, numeric attribute
  2. Region of employment, categorical feature (6 categories)
  3. Profession, categorical feature (5 categories)
  4. Type of organization in which the borrower works, categorical feature (5 categories)
  5. The day of the month on which the application is filed, the binary sign - falls in the interval from the 9th to the 21st day or not
  6. Number of credit history requests for:
    • hour
    • day
    • a week
    • month
    • quarter
    • year
  7. Marital status, categorical feature (8 categories)
  8. Number of family members, numeric attribute
  9. Experience at the last place of work, numeric attribute
  10. Borrower's age, numeric attribute
  11. Monthly income, numeric attribute
  12. Additional income, numeric attribute
  13. Whether contact 2 is full or not, binary sign
  14. Whether contact 3 is full or not, binary sign

All of the above data is economically sound and easy to collect. Despite the fact that they do not carry complete information about the borrower, on their basis it is possible to build a cost-effective and working model.


I will tell you about the architecture selection process and the resulting results in the next article.
I hope it was interesting and useful.


Panenko Mark, Devim


')

Source: https://habr.com/ru/post/454574/


All Articles