📜 ⬆️ ⬇️

Open machine learning course. Topic 2: Data Visualization with Python


The second lesson is devoted to data visualization in Python. First, we will look at the basic methods of the Seaborn and Plotly libraries, then we will analyze the data set familiar to us from the first article on the outflow of customers of the telecom operator and look into the n-dimensional space using the t-SNE algorithm. There is also a video of a lecture based on this article as part of the second launch of the open course (September-November 2017).


UPD: now the course is in English under the brand mlcourse.ai with articles on Medium, and materials on Kaggle ( Dataset ) and on GitHub .


Now the article will be much longer. Ready? Go!



Plan for this article



Demonstration of the main methods of Seaborn and Plotly


At the beginning, as always, we set up the environment: we import all the necessary libraries and slightly adjust the default display of images.


#   Anaconda import warnings warnings.simplefilter('ignore') #      jupyter'e %matplotlib inline import seaborn as sns import matplotlib.pyplot as plt #  svg    %config InlineBackend.figure_format = 'svg' #    from pylab import rcParams rcParams['figure.figsize'] = 8, 5 import pandas as pd 

After that we will load data with which we will work into DataFrame . For the examples, I chose the sales and rating data for video games from Kaggle Datasets .


 df = pd.read_csv('../../data/video_games_sales.csv') df.info() 

 <class 'pandas.core.frame.DataFrame'> RangeIndex: 16719 entries, 0 to 16718 Data columns (total 16 columns): Name 16717 non-null object Platform 16719 non-null object Year_of_Release 16450 non-null float64 Genre 16717 non-null object Publisher 16665 non-null object NA_Sales 16719 non-null float64 EU_Sales 16719 non-null float64 JP_Sales 16719 non-null float64 Other_Sales 16719 non-null float64 Global_Sales 16719 non-null float64 Critic_Score 8137 non-null float64 Critic_Count 8137 non-null float64 User_Score 10015 non-null object User_Count 7590 non-null float64 Developer 10096 non-null object Rating 9950 non-null object dtypes: float64(9), object(7) memory usage: 2.0+ MB 

Some features that pandas counted as object will be explicitly cast to float or int types.


 df['User_Score'] = df.User_Score.astype('float64') df['Year_of_Release'] = df.Year_of_Release.astype('int64') df['User_Count'] = df.User_Count.astype('int64') df['Critic_Count'] = df.Critic_Count.astype('int64') 

Data is not available for all games, so let's leave only those records that do not have gaps using the dropna method.


 df = df.dropna() print(df.shape) 

 (6825, 16) 

In total, there are 6825 objects in the table and 16 features for them. Let's look at the first few entries using the head method to make sure that everything is correct. For convenience, I left only those signs that we will continue to use.


 useful_cols = ['Name', 'Platform', 'Year_of_Release', 'Genre', 'Global_Sales', 'Critic_Score', 'Critic_Count', 'User_Score', 'User_Count', 'Rating' ] df[useful_cols].head() 

img


Before we get into the methods of the seaborn and plotly , let's discuss the simplest and often convenient way to visualize data from the pandas DataFrame - use the plot function.
For example, let's build a graph of sales of video games in different countries depending on the year. To begin with, we filter only the columns we need, then calculate the total sales by year and call the plot function with no parameters for the resulting DataFrame .


 sales_df = df[[x for x in df.columns if 'Sales' in x] + ['Year_of_Release']] sales_df.groupby('Year_of_Release').sum().plot() 

The implementation of the plot function in pandas based on the matplotlib library.


img


Using the kind parameter, you can change the type of chart, for example, to bar chart . Matplotlib allows you to customize graphics very flexibly. On the graph, you can change almost anything, but you will need to rummage through the documentation and find the necessary parameters. For example, the parameter rot is responsible for the slope of the labels to the x axis.


 sales_df.groupby('Year_of_Release').sum().plot(kind='bar', rot=45) 

img


Seaborn


Now let's go to the seaborn library. Seaborn is essentially a higher level API based on the matplotlib library. Seaborn contains more adequate default settings for graphic design. Also in the library there are quite complex types of visualization, which in matplotlib would require a large amount of code.


Let's get acquainted with the first such "complex" type of graphs of the pair plot ( scatter plot matrix ). This visualization will help us to look at one picture, as different signs are interconnected.


 cols = ['Global_Sales', 'Critic_Score', 'Critic_Count', 'User_Score', 'User_Count'] sns_plot = sns.pairplot(df[cols]) sns_plot.savefig('pairplot.png') 

As you can see, histograms of the sign distributions are located on the diagonal of the matrix of graphs. The rest of the graphs are the usual scatter plots for the corresponding pairs of signs.


To save graphs to files, use the savefig method.


img


With the help of seaborn you can also build dist plot distribution. For example, let's look at the distribution of Critic_Score critics' Critic_Score . By default, a histogram and kernel density estimation are displayed on the graph.


 sns.distplot(df.Critic_Score) 

img


In order to take a closer look at the relationship of two numerical features, there is also a joint plot - this is a hybrid of the scatter plot and histogram . Let's look at how the critique of Critic_Score and Critic_Score are related to each other.


img


Another useful type of graph is box plot . Let's compare the ratings of games from critics for the top 5 largest gaming platforms.


 top_platforms = df.Platform.value_counts().sort_values(ascending = False).head(5).index.values sns.boxplot(y="Platform", x="Critic_Score", data=df[df.Platform.isin(top_platforms)], orient="h") 

img


I think it is worth discussing a little more in detail how to understand box plot . Box plot consists of a box (which is why it is called a box plot ), antenna bars and points. The box shows the interquartile range of distribution, that is, respectively 25% ( Q1 ) and 75% ( Q3 ) percentile. The dash inside the box indicates the median of the distribution.
With the box figured out, let's go to the mustache. The whiskers reflect the entire range of points except for outliers, that is, the minimum and maximum values ​​that fall in the gap (Q1 - 1.5*IQR, Q3 + 1.5*IQR) , where IQR = Q3 - Q1 is the interquartile range. The dots on the graph indicate outliers (those values ​​that do not fit into the range of values ​​given by the graph mustache).


It’s better to see once to understand, so here’s a picture from Wikipedia :
img


And one more type of graphs (the last of those we consider in this article) is the heat map . Heat map allows you to look at the distribution of a numerical trait by two categorical. We visualize the total sales of games by genre and gaming platforms.


 platform_genre_sales = df.pivot_table( index='Platform', columns='Genre', values='Global_Sales', aggfunc=sum).fillna(0).applymap(float) sns.heatmap(platform_genre_sales, annot=True, fmt=".1f", linewidths=.5) 

img


Plotly


We looked at visualizations based on the matplotlib library. However, this is not the only option for python graphing. We will also plotly acquainted with the plotly library. Plotly is an open-source library that allows you to build interactive graphics in jupyter.notebook'e without having to dig into javascript code.


The beauty of interactive graphs is that you can see the exact numerical value when you hover the mouse, hide uninteresting rows in the visualization, zoom in on a certain part of the graph, etc.


Before we begin, we import all the necessary modules and initialize plotly using the init_notebook_mode .


 from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot import plotly import plotly.graph_objs as go init_notebook_mode(connected=True) 

To begin with, we will build a line plot with the dynamics of the number of games released and their sales by year.


 #          years_df = df.groupby('Year_of_Release')[['Global_Sales']].sum().join( df.groupby('Year_of_Release')[['Name']].count() ) years_df.columns = ['Global_Sales', 'Number_of_Games'] #       trace0 = go.Scatter( x=years_df.index, y=years_df.Global_Sales, name='Global Sales' ) #       trace1 = go.Scatter( x=years_df.index, y=years_df.Number_of_Games, name='Number of games released' ) #      title   layout data = [trace0, trace1] layout = {'title': 'Statistics of video games'} # c  Figure    fig = go.Figure(data=data, layout=layout) iplot(fig, show_link=False) 

In plotly visualization of the object Figure , which consists of data (an array of lines, which are called traces in the library) and design / style, is responsible for the layout object. In simple cases, you can call the iplot function and just from the traces array.


The show_link parameter show_link responsible for links to the plot.ly online platform on the graphs. Since usually this functionality is not needed, I prefer to hide it to prevent accidental clicks.


img


You can immediately save the graph as an html file.


 plotly.offline.plot(fig, filename='years_stats.html', show_link=False) 

We will also look at the market share of gaming platforms, calculated by the number of games released and by total revenue. To do this, build a bar chart .


 #         platforms_df = df.groupby('Platform')[['Global_Sales']].sum().join( df.groupby('Platform')[['Name']].count() ) platforms_df.columns = ['Global_Sales', 'Number_of_Games'] platforms_df.sort_values('Global_Sales', ascending=False, inplace=True) #  traces   trace0 = go.Bar( x=platforms_df.index, y=platforms_df.Global_Sales, name='Global Sales' ) trace1 = go.Bar( x=platforms_df.index, y=platforms_df.Number_of_Games, name='Number of games released' ) #       title     x  layout data = [trace0, trace1] layout = {'title': 'Share of platforms', 'xaxis': {'title': 'platform'}} #   Figure    fig = go.Figure(data=data, layout=layout) iplot(fig, show_link=False) 

img


In plotly you can build a box plot . Consider the distribution of critics' ratings depending on the game genre.


 #  Box trace       data = [] for genre in df.Genre.unique(): data.append( go.Box(y=df[df.Genre==genre].Critic_Score, name=genre) ) #   iplot(data, show_link = False) 

img


With the help of plotly you can build other types of visualizations. The graphics are pretty nice with default settings. However, the library allows you to flexibly customize various rendering parameters: colors, fonts, captions, annotations, and more.


An example of visual data analysis


We read in DataFrame data on the outflow of customers of the telecom operator familiar to us from the first article .


 df = pd.read_csv('../../data/telecom_churn.csv') 

Check whether everything was considered normal - let's look at the first 5 lines ( head method).


 df.head() 



Number of rows (clients) and columns (features):


 df.shape 

 (3333, 20) 

Let's look at the signs and see that there are no gaps in any of them - there are 3333 entries everywhere.


 df.info() 

 <class 'pandas.core.frame.DataFrame'> RangeIndex: 3333 entries, 0 to 3332 Data columns (total 20 columns): State 3333 non-null object Account length 3333 non-null int64 Area code 3333 non-null int64 International plan 3333 non-null object Voice mail plan 3333 non-null object Number vmail messages 3333 non-null int64 Total day minutes 3333 non-null float64 Total day calls 3333 non-null int64 Total day charge 3333 non-null float64 Total eve minutes 3333 non-null float64 Total eve calls 3333 non-null int64 Total eve charge 3333 non-null float64 Total night minutes 3333 non-null float64 Total night calls 3333 non-null int64 Total night charge 3333 non-null float64 Total intl minutes 3333 non-null float64 Total intl calls 3333 non-null int64 Total intl charge 3333 non-null float64 Customer service calls 3333 non-null int64 Churn 3333 non-null bool dtypes: bool(1), float64(8), int64(8), object(3) memory usage: 498.1+ KB 

Feature description
TitleDescriptionType of
StateState letter codecategorical
Account lengthHow long does a customer have been served by the company?quantitative
Area codePhone number prefixquantitative
International planInternational roaming (connected / not connected)binary
Voice mail planVoicemail (connected / not connected)binary
Number vmail messagesNumber of voice messagesquantitative
Total day minutesTotal duration of conversations in the afternoonquantitative
Total day callsTotal number of calls during the dayquantitative
Total day chargeTotal amount of payment for services in the afternoonquantitative
Total eve minutesThe total duration of conversations in the eveningquantitative
Total eve callsTotal number of calls in the eveningquantitative
Total eve chargeThe total amount of payment for services in the eveningquantitative
Total night minutesThe total duration of conversations at nightquantitative
Total night callsTotal number of calls at nightquantitative
Total night chargeTotal payment for services at nightquantitative
Total intl minutesTotal duration of international callsquantitative
Total intl callsTotal international callsquantitative
Total intl chargeTotal amount of payment for international callsquantitative
Customer service callsThe number of calls to the service centerquantitative

Target variable: Churn - Outflow symptom, binary (1 - client loss, i.e. outflow). Then we will build models that predict this sign for the rest, so we called it target.


Let's look at the distribution of the target class - customer churn.


 df['Churn'].value_counts() 

 False 2850 True 483 Name: Churn, dtype: int64 

 df['Churn'].value_counts().plot(kind='bar', label='Churn') plt.legend() plt.title('  '); 



Select the following groups of signs (among all but Churn ):



Let's look at the correlation of quantitative traits. From the colored correlation matrix, one can see that signs such as Total day charge are counted according to the minutes spoken ( Total day minutes ). That is, 4 signs can be thrown out, they do not carry useful information.


 corr_matrix = df.drop(['State', 'International plan', 'Voice mail plan', 'Area code'], axis=1).corr() 

 sns.heatmap(corr_matrix); 


')

Now let's look at the distribution of all quantitative traits of interest to us. We will look at binary / categorical / ordinal signs separately.


 features = list(set(df.columns) - set(['State', 'International plan', 'Voice mail plan', 'Area code', 'Total day charge', 'Total eve charge', 'Total night charge', 'Total intl charge', 'Churn'])) df[features].hist(figsize=(20,12)); 



We see that most of the signs are distributed normally. Exceptions are the number of calls to the service center ( Customer service calls ) (the Poisson distribution is more appropriate here) and the number of voice messages ( Number vmail messages , the peak at zero, that is, those whose voicemail is not connected). The distribution of the number of international calls ( Total intl calls ) is also shifted.


It is also useful to build such pictures, where the distribution of signs are drawn on the main diagonal, and outside the main diagonal there are scatter diagrams for pairs of signs. It happens that this leads to some conclusions, but in this case, everything is approximately clear, without surprises.


 sns.pairplot(df[features + ['Churn']], hue='Churn'); 



Next, we will look at how the signs are related to the target - to the outflow.


Construct boxplot-s, describing the statistics of the distribution of quantitative traits in two groups: among loyal and departed customers.


 fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 10)) for idx, feat in enumerate(features): sns.boxplot(x='Churn', y=feat, data=df, ax=axes[idx / 4, idx % 4]) axes[idx / 4, idx % 4].legend() axes[idx / 4, idx % 4].set_xlabel('Churn') axes[idx / 4, idx % 4].set_ylabel(feat); 



By eye, we see the biggest difference for the signs of Total day minutes , Customer service calls and Number vmail messages . Subsequently, we will learn to determine the importance of signs in a classification task using random forest (or gradient boosting), and it turns out that the first two are really very important signs for predicting outflow.


Let's look separately at the pictures with the distribution of the number of minutes spoken in the afternoon among loyal / departed. On the left - familiar to us boxboxes, on the right - smoothed histograms of the distribution of a numerical attribute in two groups (rather, rather, a beautiful picture, everything is clear from boxsplot).


An interesting observation: on average, retired customers use more communication. Perhaps they are dissatisfied with tariffs, and one of the measures to combat the outflow will be a reduction in tariff rates (the cost of mobile communication). But this is the company will need to conduct additional economic analysis, whether such measures will be justified.


 _, axes = plt.subplots(1, 2, sharey=True, figsize=(16,6)) sns.boxplot(x='Churn', y='Total day minutes', data=df, ax=axes[0]); sns.violinplot(x='Churn', y='Total day minutes', data=df, ax=axes[1]); 



Now we will depict the distribution of the number of calls to the service center (we built this picture in the first article). There are not many unique values ​​of the feature (the feature can be considered both quantitative integer and ordinal), and more graphically depict the distribution using countplot . Observation: the share of churn increases greatly from 4 calls to the service center.


 sns.countplot(x='Customer service calls', hue='Churn', data=df); 



Now let's look at the connection between the binary features of the International plan and the Voice mail plan with the outflow. Observation : when roaming is connected, the outflow rate is much higher, i.e. International roaming is a strong sign. About voice mail this can not be said.


 _, axes = plt.subplots(1, 2, sharey=True, figsize=(16,6)) sns.countplot(x='International plan', hue='Churn', data=df, ax=axes[0]); sns.countplot(x='Voice mail plan', hue='Churn', data=df, ax=axes[1]); 



Finally, let’s see how the categorical attribute State is associated with outflow. It is not so pleasant to work with him, since the number of unique states is quite large - 51. You can build a summary tablet at the beginning or calculate the percentage of outflow for each state. But the data for each state is not enough (there are only 3 to 17 customers left in each state), so perhaps the State sign should not be added to the classification models because of the risk of retraining (but we will check for cross-validation , stay tuned!).


Outflow shares for each state:


 df.groupby(['State'])['Churn'].agg([np.mean]).sort_values(by='mean', ascending=False).T 




It can be seen that in New Jersey and California, the share of outflow is above 25%, while in Hawaii and Alaska it is less than 5%. But these conclusions are based on too modest statistics and it is possible that these are just the features of the available data (here you can also test hypotheses about the correlation of Matthews and Cramer, but this is beyond the scope of this article).


Peeping into n-dimensional space with t-SNE


Construct a t-SNE representation of the same outflow data. The name of the method is complex - t-distributed Stohastic Neighbor Embedding, mathematics is also cool (and we will not go into it, but for those who want it - this is the original article by D. Hinton and his graduate student in JMLR), but the basic idea is simple, like a door: we will find such a display from the multidimensional feature space to the plane (or in 3D, but almost always 2D is chosen), so that points that are far from each other on the plane also appear remote, and close points also map to close ones. That is, neighbor embedding is a kind of search for a new view of data, in which the neighborhood is preserved.


A few details: let's throw out the states and the sign of churn, binary Yes / No-signs translate into numbers ( pd.factorize ). You also need to scale the sample - subtract its average from each attribute and divide it by the standard deviation, StandardScaler does.


 from sklearn.manifold import TSNE from sklearn.preprocessing import StandardScaler 

 #     ,   X = df.drop(['Churn', 'State'], axis=1) X['International plan'] = pd.factorize(X['International plan'])[0] X['Voice mail plan'] = pd.factorize(X['Voice mail plan'])[0] scaler = StandardScaler() X_scaled = scaler.fit_transform(X) 

 %%time tsne = TSNE(random_state=17) tsne_representation = tsne.fit_transform(X_scaled) 

 CPU times: user 20 s, sys: 2.41 s, total: 22.4 s Wall time: 21.9 s 

 plt.scatter(tsne_representation[:, 0], tsne_representation[:, 1]); 



Let us color the resulting t-SNE outflow data representation (blue - loyal, orange - gone customers).


 plt.scatter(tsne_representation[:, 0], tsne_representation[:, 1], c=df['Churn'].map({0: 'blue', 1: 'orange'})); 



We see that the departed customers are mostly "handful" in some areas of the attribute space.


To better understand the picture, you can also paint it on the other binary features - roaming and voice mail. The blue areas correspond to objects with this binary sign.


 _, axes = plt.subplots(1, 2, sharey=True, figsize=(16,6)) axes[0].scatter(tsne_representation[:, 0], tsne_representation[:, 1], c=df['International plan'].map({'Yes': 'blue', 'No': 'orange'})); axes[1].scatter(tsne_representation[:, 0], tsne_representation[:, 1], c=df['Voice mail plan'].map({'Yes': 'blue', 'No': 'orange'})); axes[0].set_title('International plan'); axes[1].set_title('Voice mail plan'); 



Now it is clear that, for example, a lot of departed clients are grouped in the left cluster of people with roaming turned on, but without voice mail.


Finally, we note the disadvantages of t-SNE (yes, it is also better to write a separate article on it):



And a couple of pictures. With the help of t-SNE, you can really get a good idea of ​​the data (as in the case of handwritten numbers, here's a good article), or you can just draw a Christmas tree toy.




Homework number 2


Actual homework is announced during the regular session of the course, you can follow in the VC group and in the repository of the course.


In order to consolidate the material, we propose to perform this task - to conduct a visual analysis of data on publications in Habrahabr. You can check yourself by sending the answers in a web form (there you will also find a solution).


Overview of useful resources



yorko ( ).

Source: https://habr.com/ru/post/323210/


All Articles