The course consists of:
The current launch of the course - from October 1, 2018 in English ( link to the survey to participate, fill in English). Follow the announcements in the VC group , join the OpenDataScience community.
We do not set ourselves the task of developing another comprehensive introductory course on machine learning or data analysis (that is, it is not a substitute for the specialization of Yandex and MIPT, additional education of HSE and other fundamental online and offline programs and books). The purpose of this series of articles is to quickly refresh your knowledge or help find topics for further study. The approach is approximately the same as that of the authors of the book Deep Learning , which begins with a review of mathematics and the basics of machine learning - a brief, very capacious and with an abundance of references to sources.
If you plan to complete the course, then we warn you: when selecting topics and creating materials, we focus on the fact that our students know mathematics at the second-year level of a technical college and at least know how to program in Python . These are not strict selection criteria, but only recommendations - you can enroll in a course without knowing mathematics or Python, and in parallel build up:
Also about the course described in this announcement.
To complete the course you need a number of Python packages, most of them are in the Anaconda build with Python 3.6. Other libraries will be needed a little later, this will be discussed later. The full list can be viewed in the Dockerfile .
You can also use a Docker container in which all the necessary software is already installed. Details are on the Wiki repository page .
No formal registration is required, you can connect to the course at any time after the start (10/01/18), but homework deadlines are tight.
But so that we know more about you:
All code can be reproduced in this Jupyter notebook.
Pandas is a Python library that provides extensive data analysis capabilities. The data that datasenists work with is often stored in the form of tablets — for example, in .csv, .tsv, or .xlsx formats. Using the Pandas library, such tabular data is very convenient to load, process and analyze using SQL-like queries. And in conjunction with the libraries Matplotlib and Seaborn Pandas provides ample opportunities for visual analysis of tabular data.
The main data structures in Pandas are the Series and DataFrame classes . The first one is a one-dimensional indexed data array of some fixed type. The second is a two-dimensional data structure, which is a table, each column of which contains data of one type. You can think of it as a dictionary of Series objects. The DataFrame structure is great for representing real data: the rows correspond to feature descriptions of individual objects, and the columns correspond to features.
# Pandas Numpy import pandas as pd import numpy as np
We will show the main methods in the business, analyzing a set of data on the outflow of customers of the telecom operator (no need to download, it is in the repository). Let's read the data (the read_csv
method) and look at the first 5 lines using the head
method:
df = pd.read_csv('../../data/telecom_churn.csv')
df.head()
In Jupyter laptops, Pandas data frames are displayed in the form of such beautiful tablets, and print(df.head())
looks worse.
By default, Pandas displays only 20 columns and 60 rows, so if your data frame is more, use the set_option
function:
pd.set_option('display.max_columns', 100) pd.set_option('display.max_rows', 100)
Each line represents a single client - this is the object of study.
Columns - signs of the object.
Title | Description | Type of |
---|---|---|
State | State letter code | nominal |
Account length | How long does a customer have been served by the company? | quantitative |
Area code | Phone number prefix | quantitative |
International plan | International roaming (connected / not connected) | binary |
Voice mail plan | Voicemail (connected / not connected) | binary |
Number vmail messages | Number of voice messages | quantitative |
Total day minutes | Total duration of conversations in the afternoon | quantitative |
Total day calls | Total number of calls during the day | quantitative |
Total day charge | Total amount of payment for services in the afternoon | quantitative |
Total eve minutes | The total duration of conversations in the evening | quantitative |
Total eve calls | Total number of calls in the evening | quantitative |
Total eve charge | The total amount of payment for services in the evening | quantitative |
Total night minutes | The total duration of conversations at night | quantitative |
Total night calls | Total number of calls at night | quantitative |
Total night charge | Total payment for services at night | quantitative |
Total intl minutes | Total duration of international calls | quantitative |
Total intl calls | Total international calls | quantitative |
Total intl charge | Total amount of payment for international calls | quantitative |
Customer service calls | The number of calls to the service center | quantitative |
Target variable: Churn - Outflow symptom, binary sign (1 - client loss, i.e. outflow). Then we will build models that predict this sign for the rest, so we called it target.
Let's look at the size of the data, the names of features and their types.
print(df.shape)
(3333, 20)
We see that in table 3333 rows and 20 columns. Let's output the column names:
print(df.columns)
Index(['State', 'Account length', 'Area code', 'International plan', 'Voice mail plan', 'Number vmail messages', 'Total day minutes', 'Total day calls', 'Total day charge', 'Total eve minutes', 'Total eve calls', 'Total eve charge', 'Total night minutes', 'Total night calls', 'Total night charge', 'Total intl minutes', 'Total intl calls', 'Total intl charge', 'Customer service calls', 'Churn'], dtype='object')
To view general information on the data frame and all features, use the info
method:
print(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3333 entries, 0 to 3332 Data columns (total 20 columns): State 3333 non-null object Account length 3333 non-null int64 Area code 3333 non-null int64 International plan 3333 non-null object Voice mail plan 3333 non-null object Number vmail messages 3333 non-null int64 Total day minutes 3333 non-null float64 Total day calls 3333 non-null int64 Total day charge 3333 non-null float64 Total eve minutes 3333 non-null float64 Total eve calls 3333 non-null int64 Total eve charge 3333 non-null float64 Total night minutes 3333 non-null float64 Total night calls 3333 non-null int64 Total night charge 3333 non-null float64 Total intl minutes 3333 non-null float64 Total intl calls 3333 non-null int64 Total intl charge 3333 non-null float64 Customer service calls 3333 non-null int64 Churn 3333 non-null bool dtypes: bool(1), float64(8), int64(8), object(3) memory usage: 498.1+ KB None
bool
, int64
, float64
and object
are the types of attributes. We see that 1 sign is logical (bool), 3 signs are of object type and 16 signs are numeric. Also, using the info
method, it is convenient to quickly look at the gaps in the data, in our case they are not there, there are 3333 observations in each column.
You can change the column type using the astype
method. Apply this method to the Churn
feature and translate it to int64
:
df['Churn'] = df['Churn'].astype('int64')
The describe
method shows the main statistical characteristics of the data for each numeric attribute (types int64
and float64
): the number of non-omitted values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.
df.describe()
To view statistics on non-numeric characters, you need to explicitly specify the types of interest in the include
parameter.
df.describe(include=['object', 'bool'])
State | International plan | Voice mail plan | |
---|---|---|---|
count | 3333 | 3333 | 3333 |
unique | 51 | 2 | 2 |
top | WV | No | No |
freq | 106 | 3010 | 2411 |
For categorical ( object
type) and boolean ( bool
type) attributes, you can use the value_counts
method. Let's look at the distribution of data on our target variable - Churn
:
df['Churn'].value_counts()
0 2850 1 483 Name: Churn, dtype: int64
2850 users out of 3333 are loyal, the value of the Churn
variable is 0
.
Let's look at the distribution of users by the variable Area code
. Specify the value of the parameter normalize=True
to view not absolute frequencies, but relative ones.
df['Area code'].value_counts(normalize=True)
415 0.496550 510 0.252025 408 0.251425 Name: Area code, dtype: float64
DataFrame can be sorted by the value of any of the signs. In our case, for example, by Total day charge
( ascending=False
for sorting in descending order):
df.sort_values(by='Total day charge', ascending=False).head()
You can also sort by column group:
df.sort_values(by=['Churn', 'Total day charge'], ascending=[True, False]).head()
Thanks for the comment about the outdated sort of makkos
DataFrame can be indexed in different ways. In this regard, we consider various ways of indexing and extracting the data we need from a data frame using simple questions as an example.
To retrieve a single column, you can use the construction of the form DataFrame['Name']
. We use this to answer the question: what is the proportion of people disloyal users in our data frame?
df['Churn'].mean(). # : 0.14491449144914492
14.5% is a rather bad indicator for a company, with such a percentage of churn you can go broke.
Logical indexing of a DataFrame on one column is very convenient. It looks like this: df[P(df['Name'])]
, where P
is some logical condition, checked for each element of the Name
column. The result of this indexing is a DataFrame, consisting only of rows that satisfy the condition P
on the Name
column.
We will use this to answer the question: what are the mean values of numerical attributes among disloyal users?
df[df['Churn'] == 1].mean()
Account length 102.664596 Number vmail messages 5.115942 Total day minutes 206.914079 Total day calls 101.335404 Total day charge 35.175921 Total eve minutes 212.410145 Total eve calls 100.561077 Total eve charge 18.054969 Total night minutes 205.231677 Total night calls 100.399586 Total night charge 9.235528 Total intl minutes 10.700000 Total intl calls 4.163561 Total intl charge 2.889545 Customer service calls 2.229814 Churn 1.000000 dtype: float64
Combining the previous two types of indexing, we will answer the question: how many on average do disloyal users talk on the phone during the day ?
df[df['Churn'] == 1]['Total day minutes'].mean() # : 206.91407867494823
What is the maximum length of international calls among loyal users ( Churn == 0
) who do not use the international roaming service ( 'International plan' == 'No'
)?
df[(df['Churn'] == 0) & (df['International plan'] == 'No')]['Total intl minutes'].max() # : 18.899999999999999
Data frames can be indexed either by the name of a column or row, or by their sequence number. For indexing by name , the loc
method is used, by number - iloc
.
In the first case, we say “give us the values for the id rows from 0 to 5 and for the columns from State to Area code” , and in the second, “give us the values of the first five rows in the first three columns” .
Note to the hostess : when we pass a slice object to iloc , the data frame is sliced as usual. However, in the case of loc, both the beginning and the end of the slice are taken into account ( link to the documentation , thanks to arkane0906 for the remark).
df.loc[0:5, 'State':'Area code']
State | Account length | Area code | |
---|---|---|---|
0 | KS | 128 | 415 |
one | OH | 107 | 415 |
2 | NJ | 137 | 415 |
3 | OH | 84 | 408 |
four | Ok | 75 | 415 |
five | AL | 118 | 510 |
df.iloc[0:5, 0:3]
State | Account length | Area code | |
---|---|---|---|
0 | KS | 128 | 415 |
one | OH | 107 | 415 |
2 | NJ | 137 | 415 |
3 | OH | 84 | 408 |
four | Ok | 75 | 415 |
If we need the first or last line of the data frame, we use the df[:1]
or df[-1:]
: construct:
df[-1:]
Applying a function to each column: apply
df.apply(np.max)
State WY Account length 243 Area code 510 International plan Yes Voice mail plan Yes Number vmail messages 51 Total day minutes 350.8 Total day calls 165 Total day charge 59.64 Total eve minutes 363.7 Total eve calls 170 Total eve charge 30.91 Total night minutes 395 Total night calls 175 Total night charge 17.77 Total intl minutes 20 Total intl calls 20 Total intl charge 5.4 Customer service calls 9 Churn True dtype: object
The apply
method can also be used to apply a function to each line. To do this, specify axis=1
.
Application of function to each cell of a column: map
For example, the map
method can be used to replace values in a column by passing to it as an argument a dictionary of the form {old_value: new_value}
:
d = {'No' : False, 'Yes' : True} df['International plan'] = df['International plan'].map(d) df.head()
A similar operation can be rotated using the replace
method:
df = df.replace({'Voice mail plan': d}) df.head()
In general, grouping data in Pandas is as follows:
df.groupby(by=grouping_columns)[columns_to_show].function()
groupby
, which divides the data by grouping_columns
- a feature or set of features.columns_to_show
).Grouping data based on the value of the Churn
feature and displaying statistics on the three columns in each group.
columns_to_show = ['Total day minutes', 'Total eve minutes', 'Total night minutes'] df.groupby(['Churn'])[columns_to_show].describe(percentiles=[])
Let's do the same, but a little differently, by transferring to agg
list of functions:
columns_to_show = ['Total day minutes', 'Total eve minutes', 'Total night minutes'] df.groupby(['Churn'])[columns_to_show].agg([np.mean, np.std, np.min, np.max])
Suppose we want to see how the observations in our sample are distributed in the context of two traits - Churn
and International plan
. To do this, we can build a contingency table using the crosstab
method:
pd.crosstab(df['Churn'], df['International plan'])
International plan | No | Yes |
---|---|---|
Churn | ||
0 | 2664 | 186 |
one | 346 | 137 |
pd.crosstab(df['Churn'], df['Voice mail plan'], normalize=True)
Voice mail plan | No | Yes |
---|---|---|
Churn | ||
0 | 0.602460 | 0.252625 |
one | 0.120912 | 0.024002 |
We see that the majority of users are loyal and at the same time use additional services (international roaming / voice mail).
Advanced Excel users will surely remember such a feature as pivot tables. In Pandas, the pivot_table
method is responsible for summary tables, which takes as parameters:
values
- the list of variables for which you want to calculate the necessary statistics,index
- a list of variables by which you need to group data,aggfunc
- that, in fact, we need to calculate by groups - the amount, average, maximum, minimum or something else.Let's look at the average number of day, evening and night calls for different Area code:
df.pivot_table(['Total day calls', 'Total eve calls', 'Total night calls'], ['Area code'], aggfunc='mean').head(10)
Total day calls | Total eve calls | Total night calls | |
---|---|---|---|
Area code | |||
408 | 100.496420 | 99.788783 | 99.039379 |
415 | 100.576435 | 100.503927 | 100.398187 |
510 | 100.097619 | 99.671429 | 100.601190 |
Like many things in Pandas, adding columns to a DataFrame is possible in several ways.
For example, we want to count the total number of calls for all users. Create a total_calls
object of the Series type and insert it into the data frame:
total_calls = df['Total day calls'] + df['Total eve calls'] + \ df['Total night calls'] + df['Total intl calls'] df.insert(loc=len(df.columns), column='Total calls', value=total_calls) # loc - , Series # len(df.columns), df.head()
Adding a column from existing ones is possible and easier without creating intermediate Series:
df['Total charge'] = df['Total day charge'] + df['Total eve charge'] + df['Total night charge'] + df['Total intl charge'] df.head()
To remove columns or rows, use the drop
method, passing in the required indexes as an argument and the required value of the axis
parameter ( 1
if deleting columns, and nothing or 0
if deleting rows):
# df = df.drop(['Total charge', 'Total calls'], axis=1) df.drop([1, 2]).head() #
Let's see how the outflow is associated with the sign "Connecting international roaming" (International plan) . We will do this with the help of the crosstab summary tablet, as well as through illustrations with Seaborn (how exactly to build such pictures and analyze graphics with them - the material of the next article).
pd.crosstab(df['Churn'], df['International plan'], margins=True)
International plan | False | True | All |
---|---|---|---|
Churn | |||
0 | 2664 | 186 | 2850 |
one | 346 | 137 | 483 |
All | 3010 | 323 | 3333 |
We see that when roaming is connected, the share of outflow is much higher - an interesting observation! It is possible that large and poorly controlled spending in roaming is very conflict-causing and leads to dissatisfaction of customers with a telecom operator and, accordingly, to their outflow.
Next, look at another important feature - "The number of calls to the service center" (Customer service calls) . We will also build a pivot table and a picture.
pd.crosstab(df['Churn'], df['Customer service calls'], margins=True)
Customer service calls | 0 | one | 2 | 3 | four | five | 6 | 7 | eight | 9 | All |
---|---|---|---|---|---|---|---|---|---|---|---|
Churn | |||||||||||
0 | 605 | 1059 | 672 | 385 | 90 | 26 | eight | four | one | 0 | 2850 |
one | 92 | 122 | 87 | 44 | 76 | 40 | 14 | five | one | 2 | 483 |
All | 697 | 1181 | 759 | 429 | 166 | 66 | 22 | 9 | 2 | 2 | 3333 |
Perhaps, on the summary plate it is not so clearly seen (or it is boring to crawl through the lines with numbers), but the picture eloquently indicates that the share of the outflow increases greatly from 4 calls to the service center.
Now we add a binary sign to our DataFrame - the result of comparison of Customer service calls > 3
. And again, let's see how it is associated with the outflow.
df['Many_service_calls'] = (df['Customer service calls'] > 3).astype('int') pd.crosstab(df['Many_service_calls'], df['Churn'], margins=True)
Churn | 0 | one | All |
---|---|---|---|
Many_service_calls | |||
0 | 2721 | 345 | 3066 |
one | 129 | 138 | 267 |
All | 2850 | 483 | 3333 |
Let us combine the conditions discussed above and construct a summary table for this combination and outflow.
pd.crosstab(df['Many_service_calls'] & df['International plan'] , df['Churn'])
Churn | 0 | one |
---|---|---|
row_0 | ||
False | 2841 | 464 |
True | 9 | nineteen |
So, predicting the outflow of the client in the case when the number of calls to the service center is more than 3 and roaming is connected (and predicting loyalty - otherwise), you can expect about 85.8% of the correct hits (we are mistaken only 464 + 9 times). These 85.8%, which we obtained using very simple arguments, are a good starting point ( baseline ) for further machine learning models that we will build.
In general, before the advent of machine learning, the process of analyzing data looked like this. Summarize:
Further, the course will be conducted in English (there are articles on Medium too). The next launch is October 1, 2018.
For warm-up / training, it is proposed to analyze demographic data using Pandas. You need to fill in the missing code in the Jupyter template and select the correct answers in the web form (there you will also find the solution).
The article was written in collaboration with yorko (Yuri Kashnitsky).
Source: https://habr.com/ru/post/322626/
All Articles