Good afternoon, dear readers.
In the
last article, the introduction of data with
Pandas and
matplotlib was described . Today, I would like to show another way of displaying the results of the analysis with the help of
Vincent , which also integrates very easily with Pandas, although it takes a little more action than in the case of matplotlib.
Introduction
Vincent is a module designed to translate data from python to a JavaScript library for rendering
D3js and
Vega , which in turn provide great opportunities for interactive data visualization.
Those. this way we can perform analysis in python, and we can build graphics to the results on js. For example, it may be convenient to visualize any geographic data and plot it on a map. In addition, vincent has integration with IPython Notebook, and, like matplotlib, can display graphics directly in it.
As a demonstration of the capabilities of this module, I propose to implement 2 tasks:
- Let us show the dynamics of average per capita income of the population and the Central and Volga federal districts
- Show on the map of the Russian Federation the distribution by region of Russia per capita income for 2010
We take the statistics of
the Rosstat
site as the initial data.
Data analysis
First, let's load the data and decide whether additional processing is needed.
import pandas as pd import vincent stat = pd.read_html('Data/AVGPeopleProfit.htm', header=0, index_col=0)[0]
So for loading data this time we use the read_html
() function (this function appeared in pandas from version 0.12). In our case, 3 arguments are passed as parameters:
- Html page URL
- Line number containing column names
- The column number to be used as an index.
After loading, we got the following table:
| 1990.0 | 2000.0 | 2001.0 | 2002.0 | 2003.0 | 2004.0 | 2005.0 | 2006.0 | 2007.0 | 2008.0 | 2009.0 | 2010.0 | 2011.0 | nan |
---|
Russian Federation | NaN | 2281 | 3062 | 3947 | 5167 | 6399 | 8088 | 10155 | 12540 | 14864 | 16895 | 18951 | 20755 | NaN |
---|
Central Federal District | NaN | 3231 | 4300 | 5436 | 7189 | 8900 | 10902 | 13570 | 16631 | 18590 | 21931 | 24645 | 27091 | one |
---|
Belgorod region | NaN | 1555 | 2121 | 2762 | 3357 | 4069 | 5276 | 7083 | 9399 | 12749 | 14147 | 16993 | 18800 | 24 |
---|
Bryansk region | NaN | 1312 | 1818 | 2452 | 3136 | 3725 | 4788 | 6171 | 7626 | 10083 | 11484 | 13358 | 15348 | 52 |
---|
Vladimir region | NaN | 1280 | 1666 | 2158 | 2837 | 3363 | 4107 | 5627 | 7015 | 9480 | 10827 | 12956 | 14312 | 64 |
---|
As you can see, a small table processing is needed, since it has one column with no name and one column with empty values. Well, let's choose the necessary columns (from 2 to 13) for the remaining columns and place them in the new DataFrame.
')
stat = stat[stat.columns[1:13]]
Now we have a set of data suitable for work. Of course, the names of the columns cuts the eye, but such names will not hurt us to solve the tasks at all.
| 2000.0 | 2001.0 | 2002.0 | 2003.0 | 2004.0 | 2005.0 | 2006.0 | 2007.0 | 2008.0 | 2009.0 | 2010.0 | 2011.0 |
---|
Russian Federation | 2281 | 3062 | 3947 | 5167 | 6399 | 8088 | 10155 | 12540 | 14864 | 16895 | 18951 | 20755 |
---|
Central Federal District | 3231 | 4300 | 5436 | 7189 | 8900 | 10902 | 13570 | 16631 | 18590 | 21931 | 24645 | 27091 |
---|
Belgorod region | 1555 | 2121 | 2762 | 3357 | 4069 | 5276 | 7083 | 9399 | 12749 | 14147 | 16993 | 18800 |
---|
Bryansk region | 1312 | 1818 | 2452 | 3136 | 3725 | 4788 | 6171 | 7626 | 10083 | 11484 | 13358 | 15348 |
---|
Vladimir region | 1280 | 1666 | 2158 | 2837 | 3363 | 4107 | 5627 | 7015 | 9480 | 10827 | 12956 | 14312 |
---|
So, let's get down to doing the first task of visualizing data in 2 districts. In order to obtain data on the basis of which we construct a graph, it is necessary to select the districts of interest (Moscow and Volga Region), and then transpose the resulting table. You can do it like this:
fo = [u' ',u' '] fostat = stat[stat.index.isin(fo)].transpose()
In the above code, we first filter our data set by the districts we need with the help of the
isin () function, which checks the value of a column in a given list (analogous to the IN operator in SQL). Then use the
transpose () function to transpose the resulting dataset and write the result to a new DataFrame.
| Central Federal District | Volga Federal District |
---|
2000 | 3231 | 1726 |
---|
2001 | 4300 | 2319 |
---|
2002 | 5436 | 3035 |
---|
2003 | 7189 | 3917 |
---|
2004 | 8900 | 4787 |
---|
2005 | 10902 | 6229 |
---|
2006 | 13570 | 8014 |
---|
2007 | 16631 | 9959 |
---|
2008 | 18590 | 12392 |
---|
2009 | 21931 | 13962 |
---|
2010 | 24645 | 15840 |
---|
2011 | 27091 | 17282 |
---|
As you can see, the names of the indices in the table are now equal to the number of the year in a numeric format. This is not very convenient, so let's change the index to the date format:
fostat.set_index(pd.date_range('1999','2011', freq='AS'), inplace=True)
The
set_index () function is used to set a new index in the DataFrame. In our case, it is passed 2 parameters:
- List of new index values (may also be the column name)
- Paramert means that we replace the index in the current set, if it is False the index will not be saved
Now the data is completely ready for plotting. So, if you are working in IPython Notebook and want to see the result in real time, then for integration you need to call the function
initialize_notebook () . It will look like this:
vincent.core.initialize_notebook ()
Now we need to create an object corresponding to the type of diagram (a full list of objects can be seen in the
documentation ). In our case, it will be a linear graph. The code will be as follows:
line = vincent.Line(fostat)
You can display the graph using the display () function:
line.display()
As a result, we will see the following:
Cartogram building
Well, we coped with the first task. Now let's go to the second one. To solve this problem, we need a TopoJSON file with a map of the Russian Federation, as well as a directory of regions. Details on how to get them and what it is you can read
here . To begin with, let's load the directory of regions using
read_csv , described in one of the
previous articles:
spr = pd.read_csv('Data/russia-region-names.tsv','\t', index_col=0, header=None, names = ['name','code'], encoding='utf-8')
As you can see, there appeared several additional parameters:
- index_col - sets the column number to be used as an index
- header - in our case, means that we do not use lines from the file to define headers.
- names - gets list, elements of which will be column names
- encoding - sets the encoding in which the file is stored
If we look carefully at our
stat dataset, we can see that some of its elements contain footnotes like '1)' and '2)', which when parsing with read_html () were encoded into ordinary characters and added at the end of the corresponding lines in the index combo. In addition, before the names of cities in our set builds the letter 'g. ', but in the directory it is not. All these little things affect the fact that when we combine a set with a stat. data and directory, to tighten the codes to the regions, we will have regions without a code.
You can fix this as follows:
ew_index = stat.index.to_series() new_index = new_index.str.replace(u'(2\))|(1\))|(. )','')
The first line means that we separate the index column into a separate new series. In the second line, we replace the values corresponding to the regular expression with empty ones.
Now we need to replace the index values with the values from the new set. As shown above, you can do it like this:
tat.set_index(new_index, inplace=True)
Now we can combine our dataset with a directory to get the regional codes:
RegionProfit = stat.join(spr, how='inner')
Our data after all the manipulations look like this:
| 2000.0 | 2001.0 | 2002.0 | 2003.0 | 2004.0 | 2005.0 | 2006.0 | 2007.0 | 2008.0 | 2009.0 | 2010.0 | 2011.0 | code |
---|
Belgorod region | 1555 | 2121 | 2762 | 3357 | 4069 | 5276 | 7083 | 9399 | 12749 | 14147 | 16993 | 18800 | RU-BEL |
---|
Bryansk region | 1312 | 1818 | 2452 | 3136 | 3725 | 4788 | 6171 | 7626 | 10083 | 11484 | 13358 | 15348 | RU-BRY |
---|
Vladimir region | 1280 | 1666 | 2158 | 2837 | 3363 | 4107 | 5627 | 7015 | 9480 | 10827 | 12956 | 14312 | RU-VLA |
---|
Voronezh region | 1486 | 2040 | 2597 | 3381 | 4104 | 5398 | 6862 | 8307 | 10587 | 11999 | 13883 | 15871 | RU-VOR |
---|
Ivanovo region | 1038 | 1298 | 1778 | 2292 | 2855 | 3480 | 4457 | 5684 | 8343 | 9351 | 11124 | 13006 | RU-IVA |
---|
So let's proceed to the direct construction of the map and the application of data on it. To start, we need to create a dictionary with a description of our map:
geo_data = [{'name': 'rus',
Now let's create our map object and link our data to it. This can be done with the Map () function:
vis = vincent.Map(data=RegionProfit, geo_data=geo_data,scale=700, projection='conicEqualArea', rotate = [-105,0], center = [-10, 65], data_bind=2011, data_key='code', map_key={'rus': 'properties.region'})
As parameters, the function takes the following arguments:
- data - data set
- geo_data - object with our map
- projection , - the projection in which our map will be displayed
- rotate, center, scale - projection parameters
- data_bind - a column with data that will be displayed
- data_key - the field with the code for which the card and data will be linked
- map_key - dictionary type {'name of an object with a map': 'property name by which the binding is performed'}
But here we expect a surprise: in the author's version of vincent, the
rotate parameter can only be integer. For proper display of our map, we need the ability to enable this parameter to take values in the form of a list. To fix this, go to the
% PYTHON_PATH% / \ lib \ site-packages \ vincent \ transforms.py file to replace the piece of code responsible for checking the type of variables:
@grammar(int) def rotate(value): """The rotation of the projection""" if value < 0: raise ValueError('The rotation cannot be negative.')
on:
@grammar(list) def rotate(value): if len(value) != 2: raise ValueError('len(center) must = 2')
Now our object will be created correctly. It remains to tune our object. To begin, let's make our boundaries between mej objects less visible. To do this, we have marks (marks), which are the main building element. Details about them are written in the documentation for
Vega . In our case, the code looks like this:
vis.marks[0].properties.enter.stroke_opacity = vincent.ValueRef(value=0.5)
Now let's set our values to be colored in different colors depending on the group. This can be done with the help of
Scales objects designed to translate data values (numeric, string, dates, etc.) into values for display (piscels, colors, dimensions). Code below:
vis.scales['color'].type = 'threshold'
Well, the map is set up, now you can see what happened. As mentioned above, you can use the
display () function for this, but for unknown reasons, it did not work for me, so I first unloaded it into the final json file using the
to_json () function:
vis.to_json('example_map.json', html_out=True, html_path='example_map.html')
As parameters, it is passed 3 parameters:
- name of the final file
- html_out indicates that you need to create another html shell file
- html_path - sets the path to the html file
To view our html file you
need a simple HTTP server included in Python. To run it on the command line, run the command:
python -m SimpleHTTPServer 8000
As a result, our map will look like this:
Conclusion
Today I tried to show another way to visualize data when using
pandas . I would also like to note that the module in question is relatively young and is actively developing now. Among the shortcomings, I would note that not all objects are displayed when trying to output them directly to IPython and the inability to upload just a picture, and not a json file, especially since for vega such tools are
developed