Build a simple cartogram Pandas + Vincent

Good afternoon, dear readers.
In the last article, the introduction of data with Pandas and matplotlib was described . Today, I would like to show another way of displaying the results of the analysis with the help of Vincent , which also integrates very easily with Pandas, although it takes a little more action than in the case of matplotlib.

Introduction

Vincent is a module designed to translate data from python to a JavaScript library for rendering D3js and Vega , which in turn provide great opportunities for interactive data visualization.
Those. this way we can perform analysis in python, and we can build graphics to the results on js. For example, it may be convenient to visualize any geographic data and plot it on a map. In addition, vincent has integration with IPython Notebook, and, like matplotlib, can display graphics directly in it.
As a demonstration of the capabilities of this module, I propose to implement 2 tasks:

Let us show the dynamics of average per capita income of the population and the Central and Volga federal districts
Show on the map of the Russian Federation the distribution by region of Russia per capita income for 2010

We take the statistics of the Rosstat site as the initial data.

Data analysis

First, let's load the data and decide whether additional processing is needed.

import pandas as pd import vincent stat = pd.read_html('Data/AVGPeopleProfit.htm', header=0, index_col=0)[0]

So for loading data this time we use the read_html () function (this function appeared in pandas from version 0.12). In our case, 3 arguments are passed as parameters:

Html page URL
Line number containing column names
The column number to be used as an index.

After loading, we got the following table:

	1990.0	2000.0	2001.0	2002.0	2003.0	2004.0	2005.0	2006.0	2007.0	2008.0	2009.0	2010.0	2011.0	nan
Russian Federation	NaN	2281	3062	3947	5167	6399	8088	10155	12540	14864	16895	18951	20755	NaN
Central Federal District	NaN	3231	4300	5436	7189	8900	10902	13570	16631	18590	21931	24645	27091	one
Belgorod region	NaN	1555	2121	2762	3357	4069	5276	7083	9399	12749	14147	16993	18800	24
Bryansk region	NaN	1312	1818	2452	3136	3725	4788	6171	7626	10083	11484	13358	15348	52
Vladimir region	NaN	1280	1666	2158	2837	3363	4107	5627	7015	9480	10827	12956	14312	64

As you can see, a small table processing is needed, since it has one column with no name and one column with empty values. Well, let's choose the necessary columns (from 2 to 13) for the remaining columns and place them in the new DataFrame.
')

 stat = stat[stat.columns[1:13]]

Now we have a set of data suitable for work. Of course, the names of the columns cuts the eye, but such names will not hurt us to solve the tasks at all.

	2000.0	2001.0	2002.0	2003.0	2004.0	2005.0	2006.0	2007.0	2008.0	2009.0	2010.0	2011.0
Russian Federation	2281	3062	3947	5167	6399	8088	10155	12540	14864	16895	18951	20755
Central Federal District	3231	4300	5436	7189	8900	10902	13570	16631	18590	21931	24645	27091
Belgorod region	1555	2121	2762	3357	4069	5276	7083	9399	12749	14147	16993	18800
Bryansk region	1312	1818	2452	3136	3725	4788	6171	7626	10083	11484	13358	15348
Vladimir region	1280	1666	2158	2837	3363	4107	5627	7015	9480	10827	12956	14312

So, let's get down to doing the first task of visualizing data in 2 districts. In order to obtain data on the basis of which we construct a graph, it is necessary to select the districts of interest (Moscow and Volga Region), and then transpose the resulting table. You can do it like this:

 fo = [u'  ',u'  '] fostat = stat[stat.index.isin(fo)].transpose()

In the above code, we first filter our data set by the districts we need with the help of the isin () function, which checks the value of a column in a given list (analogous to the IN operator in SQL). Then use the transpose () function to transpose the resulting dataset and write the result to a new DataFrame.

	Central Federal District	Volga Federal District
2000	3231	1726
2001	4300	2319
2002	5436	3035
2003	7189	3917
2004	8900	4787
2005	10902	6229
2006	13570	8014
2007	16631	9959
2008	18590	12392
2009	21931	13962
2010	24645	15840
2011	27091	17282

As you can see, the names of the indices in the table are now equal to the number of the year in a numeric format. This is not very convenient, so let's change the index to the date format:

 fostat.set_index(pd.date_range('1999','2011', freq='AS'), inplace=True)

The set_index () function is used to set a new index in the DataFrame. In our case, it is passed 2 parameters:

List of new index values (may also be the column name)
Paramert means that we replace the index in the current set, if it is False the index will not be saved

Now the data is completely ready for plotting. So, if you are working in IPython Notebook and want to see the result in real time, then for integration you need to call the function initialize_notebook () . It will look like this:
vincent.core.initialize_notebook ()
Now we need to create an object corresponding to the type of diagram (a full list of objects can be seen in the documentation ). In our case, it will be a linear graph. The code will be as follows:

 line = vincent.Line(fostat) #   line.axis_titles(x=u'', y=u'. ') #   line.legend(title=u' vs ') #

You can display the graph using the display () function:

 line.display()

As a result, we will see the following:

Cartogram building

Well, we coped with the first task. Now let's go to the second one. To solve this problem, we need a TopoJSON file with a map of the Russian Federation, as well as a directory of regions. Details on how to get them and what it is you can read here . To begin with, let's load the directory of regions using read_csv , described in one of the previous articles:

 spr = pd.read_csv('Data/russia-region-names.tsv','\t', index_col=0, header=None, names = ['name','code'], encoding='utf-8')

As you can see, there appeared several additional parameters:

index_col - sets the column number to be used as an index
header - in our case, means that we do not use lines from the file to define headers.
names - gets list, elements of which will be column names
encoding - sets the encoding in which the file is stored

If we look carefully at our stat dataset, we can see that some of its elements contain footnotes like '1)' and '2)', which when parsing with read_html () were encoded into ordinary characters and added at the end of the corresponding lines in the index combo. In addition, before the names of cities in our set builds the letter 'g. ', but in the directory it is not. All these little things affect the fact that when we combine a set with a stat. data and directory, to tighten the codes to the regions, we will have regions without a code.
You can fix this as follows:

 ew_index = stat.index.to_series() new_index = new_index.str.replace(u'(2\))|(1\))|(. )','')

The first line means that we separate the index column into a separate new series. In the second line, we replace the values corresponding to the regular expression with empty ones.
Now we need to replace the index values with the values from the new set. As shown above, you can do it like this:

 tat.set_index(new_index, inplace=True)

Now we can combine our dataset with a directory to get the regional codes:

 RegionProfit = stat.join(spr, how='inner')

Our data after all the manipulations look like this:

	2000.0	2001.0	2002.0	2003.0	2004.0	2005.0	2006.0	2007.0	2008.0	2009.0	2010.0	2011.0	code
Belgorod region	1555	2121	2762	3357	4069	5276	7083	9399	12749	14147	16993	18800	RU-BEL
Bryansk region	1312	1818	2452	3136	3725	4788	6171	7626	10083	11484	13358	15348	RU-BRY
Vladimir region	1280	1666	2158	2837	3363	4107	5627	7015	9480	10827	12956	14312	RU-VLA
Voronezh region	1486	2040	2597	3381	4104	5398	6862	8307	10587	11999	13883	15871	RU-VOR
Ivanovo region	1038	1298	1778	2292	2855	3480	4457	5684	8343	9351	11124	13006	RU-IVA

So let's proceed to the direct construction of the map and the application of data on it. To start, we need to create a dictionary with a description of our map:

 geo_data = [{'name': 'rus', #  'url': 'RusMap/russia.json', #  TopoJSON    'feature': 'russia'}] #

Now let's create our map object and link our data to it. This can be done with the Map () function:

 vis = vincent.Map(data=RegionProfit, geo_data=geo_data,scale=700, projection='conicEqualArea', rotate = [-105,0], center = [-10, 65], data_bind=2011, data_key='code', map_key={'rus': 'properties.region'})

As parameters, the function takes the following arguments:

data - data set
geo_data - object with our map
projection , - the projection in which our map will be displayed
rotate, center, scale - projection parameters
data_bind - a column with data that will be displayed
data_key - the field with the code for which the card and data will be linked
map_key - dictionary type {'name of an object with a map': 'property name by which the binding is performed'}

But here we expect a surprise: in the author's version of vincent, the rotate parameter can only be integer. For proper display of our map, we need the ability to enable this parameter to take values in the form of a list. To fix this, go to the % PYTHON_PATH% / \ lib \ site-packages \ vincent \ transforms.py file to replace the piece of code responsible for checking the type of variables:

 @grammar(int) def rotate(value): """The rotation of the projection""" if value < 0: raise ValueError('The rotation cannot be negative.')

on:

 @grammar(list) def rotate(value): if len(value) != 2: raise ValueError('len(center) must = 2')

Now our object will be created correctly. It remains to tune our object. To begin, let's make our boundaries between mej objects less visible. To do this, we have marks (marks), which are the main building element. Details about them are written in the documentation for Vega . In our case, the code looks like this:

 vis.marks[0].properties.enter.stroke_opacity = vincent.ValueRef(value=0.5)

Now let's set our values to be colored in different colors depending on the group. This can be done with the help of Scales objects designed to translate data values (numeric, string, dates, etc.) into values for display (piscels, colors, dimensions). Code below:

 vis.scales['color'].type = 'threshold' #   vis.scales['color'].domain = [10000, 15000, 20000, 25000, 30000] #      vis.legend(title=u' .') #

Well, the map is set up, now you can see what happened. As mentioned above, you can use the display () function for this, but for unknown reasons, it did not work for me, so I first unloaded it into the final json file using the to_json () function:

 vis.to_json('example_map.json', html_out=True, html_path='example_map.html')

As parameters, it is passed 3 parameters:

name of the final file
html_out indicates that you need to create another html shell file
html_path - sets the path to the html file

To view our html file you need a simple HTTP server included in Python. To run it on the command line, run the command:

 python -m SimpleHTTPServer 8000

As a result, our map will look like this:

Conclusion

Today I tried to show another way to visualize data when using pandas . I would also like to note that the module in question is relatively young and is actively developing now. Among the shortcomings, I would note that not all objects are displayed when trying to output them directly to IPython and the inability to upload just a picture, and not a json file, especially since for vega such tools are developed

Source: https://habr.com/ru/post/198974/

All Articles

Build a simple cartogram Pandas + Vincent

Introduction

Data analysis

Cartogram building

Conclusion

More articles: