⬆️ ⬇️

Build a simple cartogram Pandas + Vincent

Good afternoon, dear readers.

In the last article, the introduction of data with Pandas and matplotlib was described . Today, I would like to show another way of displaying the results of the analysis with the help of Vincent , which also integrates very easily with Pandas, although it takes a little more action than in the case of matplotlib.



Introduction



Vincent is a module designed to translate data from python to a JavaScript library for rendering D3js and Vega , which in turn provide great opportunities for interactive data visualization.

Those. this way we can perform analysis in python, and we can build graphics to the results on js. For example, it may be convenient to visualize any geographic data and plot it on a map. In addition, vincent has integration with IPython Notebook, and, like matplotlib, can display graphics directly in it.

As a demonstration of the capabilities of this module, I propose to implement 2 tasks:



We take the statistics of the Rosstat site as the initial data.



Data analysis



First, let's load the data and decide whether additional processing is needed.



import pandas as pd import vincent stat = pd.read_html('Data/AVGPeopleProfit.htm', header=0, index_col=0)[0] 


So for loading data this time we use the read_html () function (this function appeared in pandas from version 0.12). In our case, 3 arguments are passed as parameters:

  1. Html page URL

  2. Line number containing column names

  3. The column number to be used as an index.



After loading, we got the following table:

1990.02000.02001.02002.02003.02004.02005.02006.02007.02008.02009.02010.02011.0nan
Russian FederationNaN228130623947516763998088101551254014864168951895120755NaN
Central Federal DistrictNaN3231430054367189890010902135701663118590219312464527091one
Belgorod regionNaN155521212762335740695276708393991274914147169931880024
Bryansk regionNaN131218182452313637254788617176261008311484133581534852
Vladimir regionNaN12801666215828373363410756277015948010827129561431264


As you can see, a small table processing is needed, since it has one column with no name and one column with empty values. Well, let's choose the necessary columns (from 2 to 13) for the remaining columns and place them in the new DataFrame.

')

 stat = stat[stat.columns[1:13]] 


Now we have a set of data suitable for work. Of course, the names of the columns cuts the eye, but such names will not hurt us to solve the tasks at all.

2000.02001.02002.02003.02004.02005.02006.02007.02008.02009.02010.02011.0
Russian Federation228130623947516763998088101551254014864168951895120755
Central Federal District3231430054367189890010902135701663118590219312464527091
Belgorod region1555212127623357406952767083939912749141471699318800
Bryansk region1312181824523136372547886171762610083114841335815348
Vladimir region128016662158283733634107562770159480108271295614312


So, let's get down to doing the first task of visualizing data in 2 districts. In order to obtain data on the basis of which we construct a graph, it is necessary to select the districts of interest (Moscow and Volga Region), and then transpose the resulting table. You can do it like this:



 fo = [u'  ',u'  '] fostat = stat[stat.index.isin(fo)].transpose() 


In the above code, we first filter our data set by the districts we need with the help of the isin () function, which checks the value of a column in a given list (analogous to the IN operator in SQL). Then use the transpose () function to transpose the resulting dataset and write the result to a new DataFrame.

Central Federal DistrictVolga Federal District
200032311726
200143002319
200254363035
200371893917
200489004787
2005109026229
2006135708014
2007166319959
20081859012392
20092193113962
20102464515840
20112709117282


As you can see, the names of the indices in the table are now equal to the number of the year in a numeric format. This is not very convenient, so let's change the index to the date format:



 fostat.set_index(pd.date_range('1999','2011', freq='AS'), inplace=True) 


The set_index () function is used to set a new index in the DataFrame. In our case, it is passed 2 parameters:

  1. List of new index values ​​(may also be the column name)

  2. Paramert means that we replace the index in the current set, if it is False the index will not be saved



Now the data is completely ready for plotting. So, if you are working in IPython Notebook and want to see the result in real time, then for integration you need to call the function initialize_notebook () . It will look like this:

vincent.core.initialize_notebook ()

Now we need to create an object corresponding to the type of diagram (a full list of objects can be seen in the documentation ). In our case, it will be a linear graph. The code will be as follows:



 line = vincent.Line(fostat) #   line.axis_titles(x=u'', y=u'. ') #   line.legend(title=u' vs ') #      


You can display the graph using the display () function:



 line.display() 


As a result, we will see the following:





Cartogram building





Well, we coped with the first task. Now let's go to the second one. To solve this problem, we need a TopoJSON file with a map of the Russian Federation, as well as a directory of regions. Details on how to get them and what it is you can read here . To begin with, let's load the directory of regions using read_csv , described in one of the previous articles:



 spr = pd.read_csv('Data/russia-region-names.tsv','\t', index_col=0, header=None, names = ['name','code'], encoding='utf-8') 


As you can see, there appeared several additional parameters:





If we look carefully at our stat dataset, we can see that some of its elements contain footnotes like '1)' and '2)', which when parsing with read_html () were encoded into ordinary characters and added at the end of the corresponding lines in the index combo. In addition, before the names of cities in our set builds the letter 'g. ', but in the directory it is not. All these little things affect the fact that when we combine a set with a stat. data and directory, to tighten the codes to the regions, we will have regions without a code.

You can fix this as follows:



 ew_index = stat.index.to_series() new_index = new_index.str.replace(u'(2\))|(1\))|(. )','') 


The first line means that we separate the index column into a separate new series. In the second line, we replace the values ​​corresponding to the regular expression with empty ones.

Now we need to replace the index values ​​with the values ​​from the new set. As shown above, you can do it like this:



 tat.set_index(new_index, inplace=True) 


Now we can combine our dataset with a directory to get the regional codes:



 RegionProfit = stat.join(spr, how='inner') 


Our data after all the manipulations look like this:

2000.02001.02002.02003.02004.02005.02006.02007.02008.02009.02010.02011.0code
Belgorod region1555212127623357406952767083939912749141471699318800RU-BEL
Bryansk region1312181824523136372547886171762610083114841335815348RU-BRY
Vladimir region128016662158283733634107562770159480108271295614312RU-VLA
Voronezh region1486204025973381410453986862830710587119991388315871RU-VOR
Ivanovo region10381298177822922855348044575684834393511112413006RU-IVA




So let's proceed to the direct construction of the map and the application of data on it. To start, we need to create a dictionary with a description of our map:



 geo_data = [{'name': 'rus', #  'url': 'RusMap/russia.json', #  TopoJSON    'feature': 'russia'}] #     


Now let's create our map object and link our data to it. This can be done with the Map () function:



 vis = vincent.Map(data=RegionProfit, geo_data=geo_data,scale=700, projection='conicEqualArea', rotate = [-105,0], center = [-10, 65], data_bind=2011, data_key='code', map_key={'rus': 'properties.region'}) 


As parameters, the function takes the following arguments:





But here we expect a surprise: in the author's version of vincent, the rotate parameter can only be integer. For proper display of our map, we need the ability to enable this parameter to take values ​​in the form of a list. To fix this, go to the % PYTHON_PATH% / \ lib \ site-packages \ vincent \ transforms.py file to replace the piece of code responsible for checking the type of variables:



 @grammar(int) def rotate(value): """The rotation of the projection""" if value < 0: raise ValueError('The rotation cannot be negative.') 


on:



 @grammar(list) def rotate(value): if len(value) != 2: raise ValueError('len(center) must = 2') 




Now our object will be created correctly. It remains to tune our object. To begin, let's make our boundaries between mej objects less visible. To do this, we have marks (marks), which are the main building element. Details about them are written in the documentation for Vega . In our case, the code looks like this:



 vis.marks[0].properties.enter.stroke_opacity = vincent.ValueRef(value=0.5) 


Now let's set our values ​​to be colored in different colors depending on the group. This can be done with the help of Scales objects designed to translate data values ​​(numeric, string, dates, etc.) into values ​​for display (piscels, colors, dimensions). Code below:



 vis.scales['color'].type = 'threshold' #   vis.scales['color'].domain = [10000, 15000, 20000, 25000, 30000] #      vis.legend(title=u' .') #   


Well, the map is set up, now you can see what happened. As mentioned above, you can use the display () function for this, but for unknown reasons, it did not work for me, so I first unloaded it into the final json file using the to_json () function:



 vis.to_json('example_map.json', html_out=True, html_path='example_map.html') 


As parameters, it is passed 3 parameters:

  1. name of the final file

  2. html_out indicates that you need to create another html shell file

  3. html_path - sets the path to the html file





To view our html file you need a simple HTTP server included in Python. To run it on the command line, run the command:



 python -m SimpleHTTPServer 8000 


As a result, our map will look like this:







Conclusion



Today I tried to show another way to visualize data when using pandas . I would also like to note that the module in question is relatively young and is actively developing now. Among the shortcomings, I would note that not all objects are displayed when trying to output them directly to IPython and the inability to upload just a picture, and not a json file, especially since for vega such tools are developed

Source: https://habr.com/ru/post/198974/



All Articles