📜 ⬆️ ⬇️

Open machine learning course. Topic 1. Primary data analysis with Pandas



The course consists of:



The current launch of the course - from October 1, 2018 in English ( link to the survey to participate, fill in English). Follow the announcements in the VC group , join the OpenDataScience community.


List of articles series


  1. Primary data analysis with Pandas
  2. Visual data analysis with Python
  3. Classification, decision trees and the method of nearest neighbors
  4. Linear classification and regression models
  5. Compositions: bagging, random forest
  6. Construction and selection of signs. Applications for word processing, image and geodata tasks
  7. Teaching without a teacher: PCA, clustering
  8. Training in gigabytes with Vowpal Wabbit
  9. Time Series Analysis with Python
  10. Gradient boosting

Plan for this article


  1. About the course
  2. Homework in the know
  3. Demonstration of the main methods of Pandas
  4. First attempts to predict churn
  5. Homework number 1
  6. Overview of useful resources

1. About the course


We do not set ourselves the task of developing another comprehensive introductory course on machine learning or data analysis (that is, it is not a substitute for the specialization of Yandex and MIPT, additional education of HSE and other fundamental online and offline programs and books). The purpose of this series of articles is to quickly refresh your knowledge or help find topics for further study. The approach is approximately the same as that of the authors of the book Deep Learning , which begins with a review of mathematics and the basics of machine learning - a brief, very capacious and with an abundance of references to sources.


If you plan to complete the course, then we warn you: when selecting topics and creating materials, we focus on the fact that our students know mathematics at the second-year level of a technical college and at least know how to program in Python . These are not strict selection criteria, but only recommendations - you can enroll in a course without knowing mathematics or Python, and in parallel build up:



Also about the course described in this announcement.


What software is needed


To complete the course you need a number of Python packages, most of them are in the Anaconda build with Python 3.6. Other libraries will be needed a little later, this will be discussed later. The full list can be viewed in the Dockerfile .


You can also use a Docker container in which all the necessary software is already installed. Details are on the Wiki repository page .


How to connect to the course


No formal registration is required, you can connect to the course at any time after the start (10/01/18), but homework deadlines are tight.
But so that we know more about you:



2. Homework in the know



3. Demonstration of the main methods of Pandas


All code can be reproduced in this Jupyter notebook.


Pandas is a Python library that provides extensive data analysis capabilities. The data that datasenists work with is often stored in the form of tablets — for example, in .csv, .tsv, or .xlsx formats. Using the Pandas library, such tabular data is very convenient to load, process and analyze using SQL-like queries. And in conjunction with the libraries Matplotlib and Seaborn Pandas provides ample opportunities for visual analysis of tabular data.


The main data structures in Pandas are the Series and DataFrame classes . The first one is a one-dimensional indexed data array of some fixed type. The second is a two-dimensional data structure, which is a table, each column of which contains data of one type. You can think of it as a dictionary of Series objects. The DataFrame structure is great for representing real data: the rows correspond to feature descriptions of individual objects, and the columns correspond to features.


#  Pandas  Numpy import pandas as pd import numpy as np 

We will show the main methods in the business, analyzing a set of data on the outflow of customers of the telecom operator (no need to download, it is in the repository). Let's read the data (the read_csv method) and look at the first 5 lines using the head method:


 df = pd.read_csv('../../data/telecom_churn.csv') 

 df.head() 



About the withdrawal of the data frame in the notebook Jupyter

In Jupyter laptops, Pandas data frames are displayed in the form of such beautiful tablets, and print(df.head()) looks worse.
By default, Pandas displays only 20 columns and 60 rows, so if your data frame is more, use the set_option function:


 pd.set_option('display.max_columns', 100) pd.set_option('display.max_rows', 100) 

Each line represents a single client - this is the object of study.
Columns - signs of the object.


Feature description
TitleDescriptionType of
StateState letter codenominal
Account lengthHow long does a customer have been served by the company?quantitative
Area codePhone number prefixquantitative
International planInternational roaming (connected / not connected)binary
Voice mail planVoicemail (connected / not connected)binary
Number vmail messagesNumber of voice messagesquantitative
Total day minutesTotal duration of conversations in the afternoonquantitative
Total day callsTotal number of calls during the dayquantitative
Total day chargeTotal amount of payment for services in the afternoonquantitative
Total eve minutesThe total duration of conversations in the eveningquantitative
Total eve callsTotal number of calls in the eveningquantitative
Total eve chargeThe total amount of payment for services in the eveningquantitative
Total night minutesThe total duration of conversations at nightquantitative
Total night callsTotal number of calls at nightquantitative
Total night chargeTotal payment for services at nightquantitative
Total intl minutesTotal duration of international callsquantitative
Total intl callsTotal international callsquantitative
Total intl chargeTotal amount of payment for international callsquantitative
Customer service callsThe number of calls to the service centerquantitative

Target variable: Churn - Outflow symptom, binary sign (1 - client loss, i.e. outflow). Then we will build models that predict this sign for the rest, so we called it target.


Let's look at the size of the data, the names of features and their types.


 print(df.shape) 

 (3333, 20) 

We see that in table 3333 rows and 20 columns. Let's output the column names:


 print(df.columns) 

 Index(['State', 'Account length', 'Area code', 'International plan', 'Voice mail plan', 'Number vmail messages', 'Total day minutes', 'Total day calls', 'Total day charge', 'Total eve minutes', 'Total eve calls', 'Total eve charge', 'Total night minutes', 'Total night calls', 'Total night charge', 'Total intl minutes', 'Total intl calls', 'Total intl charge', 'Customer service calls', 'Churn'], dtype='object') 

To view general information on the data frame and all features, use the info method:


 print(df.info()) 

 <class 'pandas.core.frame.DataFrame'> RangeIndex: 3333 entries, 0 to 3332 Data columns (total 20 columns): State 3333 non-null object Account length 3333 non-null int64 Area code 3333 non-null int64 International plan 3333 non-null object Voice mail plan 3333 non-null object Number vmail messages 3333 non-null int64 Total day minutes 3333 non-null float64 Total day calls 3333 non-null int64 Total day charge 3333 non-null float64 Total eve minutes 3333 non-null float64 Total eve calls 3333 non-null int64 Total eve charge 3333 non-null float64 Total night minutes 3333 non-null float64 Total night calls 3333 non-null int64 Total night charge 3333 non-null float64 Total intl minutes 3333 non-null float64 Total intl calls 3333 non-null int64 Total intl charge 3333 non-null float64 Customer service calls 3333 non-null int64 Churn 3333 non-null bool dtypes: bool(1), float64(8), int64(8), object(3) memory usage: 498.1+ KB None 

bool , int64 , float64 and object are the types of attributes. We see that 1 sign is logical (bool), 3 signs are of object type and 16 signs are numeric. Also, using the info method, it is convenient to quickly look at the gaps in the data, in our case they are not there, there are 3333 observations in each column.


You can change the column type using the astype method. Apply this method to the Churn feature and translate it to int64 :


 df['Churn'] = df['Churn'].astype('int64') 

The describe method shows the main statistical characteristics of the data for each numeric attribute (types int64 and float64 ): the number of non-omitted values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.


 df.describe() 


')

To view statistics on non-numeric characters, you need to explicitly specify the types of interest in the include parameter.


 df.describe(include=['object', 'bool']) 

StateInternational planVoice mail plan
count333333333333
unique5122
topWVNoNo
freq10630102411


For categorical ( object type) and boolean ( bool type) attributes, you can use the value_counts method. Let's look at the distribution of data on our target variable - Churn :


 df['Churn'].value_counts() 

 0 2850 1 483 Name: Churn, dtype: int64 

2850 users out of 3333 are loyal, the value of the Churn variable is 0 .


Let's look at the distribution of users by the variable Area code . Specify the value of the parameter normalize=True to view not absolute frequencies, but relative ones.


 df['Area code'].value_counts(normalize=True) 

 415 0.496550 510 0.252025 408 0.251425 Name: Area code, dtype: float64 

Sorting


DataFrame can be sorted by the value of any of the signs. In our case, for example, by Total day charge ( ascending=False for sorting in descending order):


 df.sort_values(by='Total day charge', ascending=False).head() 



You can also sort by column group:


 df.sort_values(by=['Churn', 'Total day charge'], ascending=[True, False]).head() 

Thanks for the comment about the outdated sort of makkos




Indexing and data retrieval


DataFrame can be indexed in different ways. In this regard, we consider various ways of indexing and extracting the data we need from a data frame using simple questions as an example.


To retrieve a single column, you can use the construction of the form DataFrame['Name'] . We use this to answer the question: what is the proportion of people disloyal users in our data frame?


 df['Churn'].mean(). # : 0.14491449144914492 

14.5% is a rather bad indicator for a company, with such a percentage of churn you can go broke.


Logical indexing of a DataFrame on one column is very convenient. It looks like this: df[P(df['Name'])] , where P is some logical condition, checked for each element of the Name column. The result of this indexing is a DataFrame, consisting only of rows that satisfy the condition P on the Name column.


We will use this to answer the question: what are the mean values ​​of numerical attributes among disloyal users?


 df[df['Churn'] == 1].mean() 

 Account length 102.664596 Number vmail messages 5.115942 Total day minutes 206.914079 Total day calls 101.335404 Total day charge 35.175921 Total eve minutes 212.410145 Total eve calls 100.561077 Total eve charge 18.054969 Total night minutes 205.231677 Total night calls 100.399586 Total night charge 9.235528 Total intl minutes 10.700000 Total intl calls 4.163561 Total intl charge 2.889545 Customer service calls 2.229814 Churn 1.000000 dtype: float64 

Combining the previous two types of indexing, we will answer the question: how many on average do disloyal users talk on the phone during the day ?


 df[df['Churn'] == 1]['Total day minutes'].mean() # : 206.91407867494823 

What is the maximum length of international calls among loyal users ( Churn == 0 ) who do not use the international roaming service ( 'International plan' == 'No' )?


 df[(df['Churn'] == 0) & (df['International plan'] == 'No')]['Total intl minutes'].max() # : 18.899999999999999 

Data frames can be indexed either by the name of a column or row, or by their sequence number. For indexing by name , the loc method is used, by number - iloc .


In the first case, we say “give us the values ​​for the id rows from 0 to 5 and for the columns from State to Area code” , and in the second, “give us the values ​​of the first five rows in the first three columns” .


Note to the hostess : when we pass a slice object to iloc , the data frame is sliced ​​as usual. However, in the case of loc, both the beginning and the end of the slice are taken into account ( link to the documentation , thanks to arkane0906 for the remark).

 df.loc[0:5, 'State':'Area code'] 

StateAccount lengthArea code
0KS128415
oneOH107415
2NJ137415
3OH84408
fourOk75415
fiveAL118510


 df.iloc[0:5, 0:3] 

StateAccount lengthArea code
0KS128415
oneOH107415
2NJ137415
3OH84408
fourOk75415


If we need the first or last line of the data frame, we use the df[:1] or df[-1:] : construct:


 df[-1:] 



Apply functions to cells, columns, and rows


Applying a function to each column: apply


 df.apply(np.max) 

 State WY Account length 243 Area code 510 International plan Yes Voice mail plan Yes Number vmail messages 51 Total day minutes 350.8 Total day calls 165 Total day charge 59.64 Total eve minutes 363.7 Total eve calls 170 Total eve charge 30.91 Total night minutes 395 Total night calls 175 Total night charge 17.77 Total intl minutes 20 Total intl calls 20 Total intl charge 5.4 Customer service calls 9 Churn True dtype: object 

The apply method can also be used to apply a function to each line. To do this, specify axis=1 .


Application of function to each cell of a column: map


For example, the map method can be used to replace values ​​in a column by passing to it as an argument a dictionary of the form {old_value: new_value} :


 d = {'No' : False, 'Yes' : True} df['International plan'] = df['International plan'].map(d) df.head() 



A similar operation can be rotated using the replace method:


 df = df.replace({'Voice mail plan': d}) df.head() 



Grouping data


In general, grouping data in Pandas is as follows:


 df.groupby(by=grouping_columns)[columns_to_show].function() 

  1. The groupby method is applied to the groupby , which divides the data by grouping_columns - a feature or set of features.
  2. Select the columns we need ( columns_to_show ).
  3. A function or several functions are applied to the received groups.

Grouping data based on the value of the Churn feature and displaying statistics on the three columns in each group.


 columns_to_show = ['Total day minutes', 'Total eve minutes', 'Total night minutes'] df.groupby(['Churn'])[columns_to_show].describe(percentiles=[]) 



Let's do the same, but a little differently, by transferring to agg list of functions:


 columns_to_show = ['Total day minutes', 'Total eve minutes', 'Total night minutes'] df.groupby(['Churn'])[columns_to_show].agg([np.mean, np.std, np.min, np.max]) 



Summary Tables


Suppose we want to see how the observations in our sample are distributed in the context of two traits - Churn and International plan . To do this, we can build a contingency table using the crosstab method:


 pd.crosstab(df['Churn'], df['International plan']) 

International planNoYes
Churn
02664186
one346137


 pd.crosstab(df['Churn'], df['Voice mail plan'], normalize=True) 

Voice mail planNoYes
Churn
00.6024600.252625
one0.1209120.024002


We see that the majority of users are loyal and at the same time use additional services (international roaming / voice mail).


Advanced Excel users will surely remember such a feature as pivot tables. In Pandas, the pivot_table method is responsible for summary tables, which takes as parameters:



Let's look at the average number of day, evening and night calls for different Area code:


 df.pivot_table(['Total day calls', 'Total eve calls', 'Total night calls'], ['Area code'], aggfunc='mean').head(10) 

Total day callsTotal eve callsTotal night calls
Area code
408100.49642099.78878399.039379
415100.576435100.503927100.398187
510100.09761999.671429100.601190


Convert Data Frames


Like many things in Pandas, adding columns to a DataFrame is possible in several ways.


For example, we want to count the total number of calls for all users. Create a total_calls object of the Series type and insert it into the data frame:


 total_calls = df['Total day calls'] + df['Total eve calls'] + \ df['Total night calls'] + df['Total intl calls'] df.insert(loc=len(df.columns), column='Total calls', value=total_calls) # loc -  ,      Series #   len(df.columns),       df.head() 



Adding a column from existing ones is possible and easier without creating intermediate Series:


 df['Total charge'] = df['Total day charge'] + df['Total eve charge'] + df['Total night charge'] + df['Total intl charge'] df.head() 



To remove columns or rows, use the drop method, passing in the required indexes as an argument and the required value of the axis parameter ( 1 if deleting columns, and nothing or 0 if deleting rows):


 #       df = df.drop(['Total charge', 'Total calls'], axis=1) df.drop([1, 2]).head() #       



4. The first attempts to predict the outflow


Let's see how the outflow is associated with the sign "Connecting international roaming" (International plan) . We will do this with the help of the crosstab summary tablet, as well as through illustrations with Seaborn (how exactly to build such pictures and analyze graphics with them - the material of the next article).


 pd.crosstab(df['Churn'], df['International plan'], margins=True) 

International planFalseTrueAll
Churn
026641862850
one346137483
All30103233333




We see that when roaming is connected, the share of outflow is much higher - an interesting observation! It is possible that large and poorly controlled spending in roaming is very conflict-causing and leads to dissatisfaction of customers with a telecom operator and, accordingly, to their outflow.


Next, look at another important feature - "The number of calls to the service center" (Customer service calls) . We will also build a pivot table and a picture.


 pd.crosstab(df['Churn'], df['Customer service calls'], margins=True) 

Customer service calls0one23fourfive67eight9All
Churn
060510596723859026eightfourone02850
one921228744764014fiveone2483
All697118175942916666229223333




Perhaps, on the summary plate it is not so clearly seen (or it is boring to crawl through the lines with numbers), but the picture eloquently indicates that the share of the outflow increases greatly from 4 calls to the service center.


Now we add a binary sign to our DataFrame - the result of comparison of Customer service calls > 3 . And again, let's see how it is associated with the outflow.


 df['Many_service_calls'] = (df['Customer service calls'] > 3).astype('int') pd.crosstab(df['Many_service_calls'], df['Churn'], margins=True) 

Churn0oneAll
Many_service_calls
027213453066
one129138267
All28504833333




Let us combine the conditions discussed above and construct a summary table for this combination and outflow.


 pd.crosstab(df['Many_service_calls'] & df['International plan'] , df['Churn']) 

Churn0one
row_0
False2841464
True9nineteen


So, predicting the outflow of the client in the case when the number of calls to the service center is more than 3 and roaming is connected (and predicting loyalty - otherwise), you can expect about 85.8% of the correct hits (we are mistaken only 464 + 9 times). These 85.8%, which we obtained using very simple arguments, are a good starting point ( baseline ) for further machine learning models that we will build.


In general, before the advent of machine learning, the process of analyzing data looked like this. Summarize:



5. Homework number 1


Further, the course will be conducted in English (there are articles on Medium too). The next launch is October 1, 2018.


For warm-up / training, it is proposed to analyze demographic data using Pandas. You need to fill in the missing code in the Jupyter template and select the correct answers in the web form (there you will also find the solution).


6. Review of useful resources



The article was written in collaboration with yorko (Yuri Kashnitsky).

Source: https://habr.com/ru/post/322626/


All Articles