📜 ⬆️ ⬇️

Calculation of the outflow of bank customers (problem solving using Python)

I want to share the experience of solving the problem of machine learning and data analysis from Kaggle. This article is positioned as a guide for beginners on the example of not quite a simple task.

Data retrieval


Data sampling contains about 10,000 rows and 15 columns. Here are some of the parameters:


Task


  1. Find the parameters that most affect the outflow of customers.
  2. Creating a hypothesis predicting the outflow of bank customers.

Tools



Import Libraries


import pandas as pd from sklearn.cross_validation import train_test_split from sklearn import svm import seaborn as sns import matplotlib.pyplot as plt from sklearn.metrics import mean_squared_error import numpy as np from sklearn.naive_bayes import GaussianNB 

Download and view data


 dataframe = pd.read_csv("../input/Churn_Modelling.csv") dataframe.head() 


')

Data conversion


For the classifier to work correctly, it is necessary to convert a categorical attribute into a numerical one. Two data are immediately apparent to the data presented above: "Gender" and "Geographical location". We will carry out conversions:

 dataframe['Geography'].replace("France",1,inplace= True) dataframe['Geography'].replace("Spain",2,inplace = True) dataframe['Geography'].replace("Germany",3,inplace=True) dataframe['Gender'].replace("Female",0,inplace = True) dataframe['Gender'].replace("Male",1,inplace=True) 

Creating a correlation matrix


 correlation = dataframe.corr() plt.figure(figsize=(15,15)) sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix') plt.title('Correlation between different fearures') plt.show() 



Correlation shows which parameters will affect the result. Immediately, 3 positive correlations can be identified: “Account Balance”, “Age”, “Geographical Position”.

Cross validation


To avoid problems with retraining, we divide our data set:

 X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.4, random_state=0) 

Forecast


 clf = GaussianNB() clf = clf.fit(X_train ,y_train) clf.score(X_test, y_test) 


The prediction accuracy was ~ 78%, which is a good result.

Source: https://habr.com/ru/post/329334/


All Articles