In this post I want to talk about how you can quite easily build interactive charts in Jupyter Notebook'e using the plotly
library. Moreover, to build them you do not need to raise your server and write code in javascript. Another big plus of the proposed approach is that visualizations will work in NBViewer , i.e. It will be easy to share your results with colleagues. Here, for example, my code for this note.
For examples, I took the data about the films downloaded in April (year of release, ratings on KinoPoisk and IMDb, genres, etc.). I downloaded the data for all films that had at least 100 ratings - only 36417 films. About how to download and parse data KinoPoisk, I told in a previous post .
In python, there are many libraries for visualization: matplotlib
, seaborn
, ported from R ggplot
and others (you can read more about the tools here or here ). Among them are those that allow you to build interactive graphics, for example, bokeh
, pygal
and plotly
, which is actually discussed.
Plotly
positioned as an online platform where you can create and publish your graphics. However, this library can also be used in Jupyter Notebook'e
. In addition, the library has offline-mode, which allows it to be used without registering and publishing data and graphs on the plotly
server ( documentation ).
In general, I liked the library very much: there are detailed documentation with examples , different types of graphs are supported (scatter plots, box plots, 3D graphics, bar charts, heatmaps, dendrograms, etc.) and the graphics are quite nice.
Now it's time to go directly to the examples. As I said above, all code and interactive graphics are available in NBViewer .
The library is easy to install using the command: pip install plotly
.
First of all, it is necessary to make imports, call the init_notebook_mode
command to initialize plot.ly
and load the data with which we will work into pandas.DataFrame
.
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot import plotly.graph_objs as go init_notebook_mode(connected=True) df = pd.read_csv('kp_all_movies.csv') # df.head()
To begin with, we will build a simple bar chart showing the distribution of films by release year.
count_year_df = df.groupby('movie_year', as_index = False).movie_id.count() trace = go.Bar( x = count_year_df.movie_year, y = count_year_df.movie_id ) layout = go.Layout( title=' ', ) fig = go.Figure(data = [trace], layout = layout) iplot(fig)
As a result, we get an interactive graph that shows the value when you hover for a year and the rather expected conclusion that over the years, films have become more.
To answer this question, we construct a graph of the average rating for KinoPoisk and IMDb depending on the year of production.
rating_year_df = df.groupby('movie_year', as_index = False)[['kp_rating', 'imdb_rating']].mean() trace_kp = go.Scatter( x = rating_year_df.movie_year, y = rating_year_df.kp_rating, mode = 'lines', name = u'' ) trace_imdb = go.Scatter( x = rating_year_df.movie_year, y = rating_year_df.imdb_rating, mode = 'lines', name = 'IMDb' ) layout = go.Layout( title=' ', ) fig = go.Figure(data = [trace_kp, trace_imdb], layout = layout) iplot(fig)
The Kinopodisk and IMDb estimates show a trend towards a decrease in the average rating depending on the year of production. But, in fact, it is impossible to draw an unequivocal conclusion from this that earlier better films were shot. The fact is that if people watch old films and evaluate them at KinoPoisk, they choose cult cinema with obviously higher ratings (I think few people watch passing films released in 1940, at least I don’t watch).
To compare the estimates depending on the genre, we construct box plot. It is worth remembering that each film may belong to several genres, so the films will be counted in several groups.
# genres data frame def parse_list(lst_str): return filter(lambda y: y != '', map(lambda x: x.strip(), re.sub(r'[\[\]]', '', lst_str).split(','))) df['genres'] = df['genres'].fillna('[]') genres_data = [] for record in df.to_dict(orient = 'records'): genres_lst = parse_list(record['genres']) for genre in genres_lst: copy = record.copy() copy['genre'] = genre copy['weight'] = 1./len(genres_lst) genres_data.append(copy) genres_df = pd.DataFrame.from_dict(genres_data) # -10 top_genres = genres_df.groupby('genre')[['movie_id']].count()\ .sort_values('movie_id', ascending = False)\ .head(10).index.values.tolist() N = float(len(top_genres)) # c c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, N)] data = [{ 'y': genres_df[genres_df.genre == top_genres[i]].kp_rating, 'type':'box', 'marker':{'color': c[i]}, 'name': top_genres[i] } for i in range(len(top_genres))] layout = go.Layout( title=' ', yaxis = {'title': ' '} ) fig = go.Figure(data = data, layout = layout) iplot(fig)
The graph shows that the most notable are low marks for horror films.
As I said above, one film most often refers to several genres at once. In order to look at which genres are more often found together, let's build a heatmap.
genres_coincidents = {} for item in df.genres: parsed_genres = parse_list(item) for genre1 in parsed_genres: if genre1 not in genres_coincidents: genres_coincidents[genre1] = defaultdict(int) for genre2 in parsed_genres: genres_coincidents[genre1][genre2] += 1 genres_coincidents_df = pd.DataFrame.from_dict(genres_coincidents).fillna(0) # genres_coincidents_df_norm = genres_coincidents_df\ .apply(lambda x: x/genres_df.groupby('genre').movie_id.count(), axis = 1) heatmap = go.Heatmap( z = genres_coincidents_df_norm.values, x = genres_coincidents_df_norm.index.values, y = genres_coincidents_df_norm.columns ) layout = go.Layout( title = ' ' ) fig = go.Figure(data = [heatmap], layout = layout) iplot(fig)
Read the schedule as follows: 74.7% of historical films also have a drama tag.
Let us return once more to the example in which we looked at the dependence of the average grade on the year of production and construct such graphs for various genres. At the same time plotly
get acquainted with one more plotly
chip: you can configure the drop-down menu and change the schedule depending on the selected option.
genre_rating_year_df = genres_df.groupby(['movie_year', 'genre'], as_index = False)[['kp_rating', 'imdb_rating']].mean() N = len(top_genres) data = [] drop_menus = [] # for i in range(N): genre = top_genres[i] genre_df = genre_rating_year_df[genre_rating_year_df.genre == genre] trace_kp = go.Scatter( x = genre_df.movie_year, y = genre_df.kp_rating, mode = 'lines', name = genre + ' ', visible = (i == 0) ) trace_imdb = go.Scatter( x = genre_df.movie_year, y = genre_df.imdb_rating, mode = 'lines', name = genre + ' IMDb', visible = (i == 0) ) data.append(trace_kp) data.append(trace_imdb) # for i in range(N): drop_menus.append( dict( args=['visible', [False]*2*i + [True]*2 + [False]*2*(N-1-i)], label= top_genres[i], method='restyle' ) ) layout = go.Layout( title=' ', updatemenus=list([ dict( x = -0.1, y = 1, yanchor = 'top', buttons = drop_menus ) ]), ) fig = go.Figure(data = data, layout = layout) iplot(fig)
In this post, we met using the plotly
library to build various interactive graphs in python. I think this is a very useful tool for analytical work, because it allows you to make interactive visualizations and easily share them with your colleagues.
Interested advise to look at other examples of the use of plot.ly.
All code and data live on github
Source: https://habr.com/ru/post/308162/
All Articles