A little about the movie or how to do interactive visualization in python

Introduction

In this post I want to talk about how you can quite easily build interactive charts in Jupyter Notebook'e using the plotly library. Moreover, to build them you do not need to raise your server and write code in javascript. Another big plus of the proposed approach is that visualizations will work in NBViewer , i.e. It will be easy to share your results with colleagues. Here, for example, my code for this note.

For examples, I took the data about the films downloaded in April (year of release, ratings on KinoPoisk and IMDb, genres, etc.). I downloaded the data for all films that had at least 100 ratings - only 36417 films. About how to download and parse data KinoPoisk, I told in a previous post .

Python and plotly visualization

In python, there are many libraries for visualization: matplotlib , seaborn , ported from R ggplot and others (you can read more about the tools here or here ). Among them are those that allow you to build interactive graphics, for example, bokeh , pygal and plotly , which is actually discussed.

Plotly positioned as an online platform where you can create and publish your graphics. However, this library can also be used in Jupyter Notebook'e . In addition, the library has offline-mode, which allows it to be used without registering and publishing data and graphs on the plotly server ( documentation ).

In general, I liked the library very much: there are detailed documentation with examples , different types of graphs are supported (scatter plots, box plots, 3D graphics, bar charts, heatmaps, dendrograms, etc.) and the graphics are quite nice.

Examples

Now it's time to go directly to the examples. As I said above, all code and interactive graphics are available in NBViewer .

The library is easy to install using the command: pip install plotly .

First of all, it is necessary to make imports, call the init_notebook_mode command to initialize plot.ly and load the data with which we will work into pandas.DataFrame .

 from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot import plotly.graph_objs as go init_notebook_mode(connected=True) df = pd.read_csv('kp_all_movies.csv') #   df.head()

How many films came out in different years?

To begin with, we will build a simple bar chart showing the distribution of films by release year.

 count_year_df = df.groupby('movie_year', as_index = False).movie_id.count() trace = go.Bar( x = count_year_df.movie_year, y = count_year_df.movie_id ) layout = go.Layout( title='  ', ) fig = go.Figure(data = [trace], layout = layout) iplot(fig)

As a result, we get an interactive graph that shows the value when you hover for a year and the rather expected conclusion that over the years, films have become more.

Have you started making better movies over the years?

To answer this question, we construct a graph of the average rating for KinoPoisk and IMDb depending on the year of production.

 rating_year_df = df.groupby('movie_year', as_index = False)[['kp_rating', 'imdb_rating']].mean() trace_kp = go.Scatter( x = rating_year_df.movie_year, y = rating_year_df.kp_rating, mode = 'lines', name = u'' ) trace_imdb = go.Scatter( x = rating_year_df.movie_year, y = rating_year_df.imdb_rating, mode = 'lines', name = 'IMDb' ) layout = go.Layout( title=' ', ) fig = go.Figure(data = [trace_kp, trace_imdb], layout = layout) iplot(fig)

The Kinopodisk and IMDb estimates show a trend towards a decrease in the average rating depending on the year of production. But, in fact, it is impossible to draw an unequivocal conclusion from this that earlier better films were shot. The fact is that if people watch old films and evaluate them at KinoPoisk, they choose cult cinema with obviously higher ratings (I think few people watch passing films released in 1940, at least I don’t watch).

Are there any differences in ratings depending on the genre of the film?

To compare the estimates depending on the genre, we construct box plot. It is worth remembering that each film may belong to several genres, so the films will be counted in several groups.

Code

 #     genres  data frame       def parse_list(lst_str): return filter(lambda y: y != '', map(lambda x: x.strip(), re.sub(r'[\[\]]', '', lst_str).split(','))) df['genres'] = df['genres'].fillna('[]') genres_data = [] for record in df.to_dict(orient = 'records'): genres_lst = parse_list(record['genres']) for genre in genres_lst: copy = record.copy() copy['genre'] = genre copy['weight'] = 1./len(genres_lst) genres_data.append(copy) genres_df = pd.DataFrame.from_dict(genres_data) #  -10  top_genres = genres_df.groupby('genre')[['movie_id']].count()\ .sort_values('movie_id', ascending = False)\ .head(10).index.values.tolist() N = float(len(top_genres)) # c    c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, N)] data = [{ 'y': genres_df[genres_df.genre == top_genres[i]].kp_rating, 'type':'box', 'marker':{'color': c[i]}, 'name': top_genres[i] } for i in range(len(top_genres))] layout = go.Layout( title=' ', yaxis = {'title': ' '} ) fig = go.Figure(data = data, layout = layout) iplot(fig)

The graph shows that the most notable are low marks for horror films.

What genres are most often side by side?

As I said above, one film most often refers to several genres at once. In order to look at which genres are more often found together, let's build a heatmap.

Code

 genres_coincidents = {} for item in df.genres: parsed_genres = parse_list(item) for genre1 in parsed_genres: if genre1 not in genres_coincidents: genres_coincidents[genre1] = defaultdict(int) for genre2 in parsed_genres: genres_coincidents[genre1][genre2] += 1 genres_coincidents_df = pd.DataFrame.from_dict(genres_coincidents).fillna(0) #        genres_coincidents_df_norm = genres_coincidents_df\ .apply(lambda x: x/genres_df.groupby('genre').movie_id.count(), axis = 1) heatmap = go.Heatmap( z = genres_coincidents_df_norm.values, x = genres_coincidents_df_norm.index.values, y = genres_coincidents_df_norm.columns ) layout = go.Layout( title = ' ' ) fig = go.Figure(data = [heatmap], layout = layout) iplot(fig)

Read the schedule as follows: 74.7% of historical films also have a drama tag.

How did the ratings of films change depending on the genre?

Let us return once more to the example in which we looked at the dependence of the average grade on the year of production and construct such graphs for various genres. At the same time plotly get acquainted with one more plotly chip: you can configure the drop-down menu and change the schedule depending on the selected option.

Code

 genre_rating_year_df = genres_df.groupby(['movie_year', 'genre'], as_index = False)[['kp_rating', 'imdb_rating']].mean() N = len(top_genres) data = [] drop_menus = [] #      for i in range(N): genre = top_genres[i] genre_df = genre_rating_year_df[genre_rating_year_df.genre == genre] trace_kp = go.Scatter( x = genre_df.movie_year, y = genre_df.kp_rating, mode = 'lines', name = genre + ' ', visible = (i == 0) ) trace_imdb = go.Scatter( x = genre_df.movie_year, y = genre_df.imdb_rating, mode = 'lines', name = genre + ' IMDb', visible = (i == 0) ) data.append(trace_kp) data.append(trace_imdb) #    for i in range(N): drop_menus.append( dict( args=['visible', [False]*2*i + [True]*2 + [False]*2*(N-1-i)], label= top_genres[i], method='restyle' ) ) layout = go.Layout( title='  ', updatemenus=list([ dict( x = -0.1, y = 1, yanchor = 'top', buttons = drop_menus ) ]), ) fig = go.Figure(data = data, layout = layout) iplot(fig)

As a conclusion

In this post, we met using the plotly library to build various interactive graphs in python. I think this is a very useful tool for analytical work, because it allows you to make interactive visualizations and easily share them with your colleagues.

Interested advise to look at other examples of the use of plot.ly.

All code and data live on github

Source: https://habr.com/ru/post/308162/

All Articles

A little about the movie or how to do interactive visualization in python

Introduction

Python and plotly visualization

Examples

How many films came out in different years?

Have you started making better movies over the years?

Are there any differences in ratings depending on the genre of the film?

What genres are most often side by side?

How did the ratings of films change depending on the genre?

As a conclusion

More articles: