Introduction to data analysis with Pandas

Today we will talk about the package Pandas . This package makes Python a powerful data analysis tool. The package allows you to build pivot tables, perform groupings, provides convenient access to tabular data, and, if available, the matplotlib package allows you to draw graphs on the resulting data sets. Next, the basics of working with the package will be shown, such as loading data, accessing fields, filtering and building pivot ones.

Basic data structures and their loading

First, let's say a couple of words about the storage structure in Pandas. The main ones are Series and DataFrame.
Series is an indexed one-dimensional array of values. It is similar to a simple dict type dictionary, where the element name will correspond to the index, and the value to the record value.
A DataFrame is an indexed multidimensional array of values, and therefore each DataFrame column is a Series structure.
So, with the structures a bit sorted out. We proceed directly to working with the package. To start analyzing any data you need to download them. Pandas provides a wide range of data sources, for example:

SQL
Text files
Excel files
HTML

More information about them can be found in the documentation .
For example, download 2 text files. This can be done with the read_csv () function:

from pandas import read_csv df1 = read_csv("df1.txt") df2 = read_csv("df2.txt",";") #

Now we have 2 df1 datasets containing stores and the number of shipments:

shop	qty
427	3
707	four
957	2
437	one

And the df2 containing the store and its city:

shop	name
347	Kiev
427	Samara
707	Minsk
957	Irkutsk
437	Moscow

Basic dataset operations

You can perform various actions on datasets, such as merging, adding columns, adding records, filtering, building aggregates, and others. Let us now, in order to demonstrate all the possibilities described above, the following tasks:

in the set with the cities of the stores, add the field `country` and fill in the relevant countries
choose the Ukrainian store and change its number
add the store, obtained in the previous step, to the general list
add the amount of df1 to the df2 set
we will build a summary table by country and number of shipments

So, to add a new column to the data set, there is an insert () command:

 country = [u'',u'',u'',u'',u''] df2.insert(1,'country',country)

In our case, the function is passed 3 arguments:

position number where new column will be inserted
new column name
an array of column values (in our case, this is a regular list)

View df2 after performing the above operations:

shop	country	name
347	Ukraine	Kiev
427	RF	Samara
707	Belarus	Minsk
957	RF	Irkutsk
437	RF	Moscow

Now you need to choose a store whose country will be equal to `Ukraine`. There are 2 ways to access the columns in the DataFrame:

through point - Data Set. Field Name
in square brackets - Dataset ['FieldName']

In order to filter the dataset, you can use the square brackets inside which there will be a Dataset [condition] condition . The condition must contain the field names, in the format described above, and the condition imposed on them. Thus, you can choose to take the store of interest and replace its number with the following:

 t = df2[df2.country == u''] t.shop = 345

The result of the execution of this code will be a new intermediate data set t, containing one record:

shop	country	name
345	Ukraine	Kiev

In order to add the record obtained in the previous step, you need to execute the function append (), which takes as an argument the data set that needs to be added to the source:

 df2 = df2.append(t)

Data aggregation

Now to our main df2 store list, you can pull up the quantity from the df1 dataset. This can be done using the merge () function, which combines two data sets (analog of join in SQL):

 res = df2.merge(df1, 'left', on='shop')

As parameters, the function takes:

data set (which will be attached to the source)
connection type
connection field

More information about the parameters can be found in the documentation. The data set before the final operation looks like this:

shop	country	name	qty
347	Ukraine	Kiev	NaN
427	RF	Samara	3
707	Belarus	Minsk	four
957	RF	Irkutsk	2
437	RF	Moscow	one
345	Ukraine	Kiev	NaN

It remains to build a pivot table in order to understand how much each country has shipped. For this there is a function pivot_table () . In our example, the function takes as parameters:

list of columns by which the aggregated value will be calculated
list of columns that will be rows of the summary table
function that is used for aggregation
parameter to replace empty values with 0

The code for building a summary looks like this:

 res.pivot_table(['qty'],['country'], aggfunc='sum', fill_value = 0)

The final table will look like this:

country	qty
Belarus	four
RF	6
Ukraine	0

Conclusion

As a conclusion, I would like to say that Pandas is a good alternative to Excel when working with large amounts of data. The functions shown are just the tip of the iceberg called Pandas. In the future, I plan to write a series of articles in which the full power of this package will be shown.

Source: https://habr.com/ru/post/196980/

All Articles

Introduction to data analysis with Pandas

Basic data structures and their loading

Basic dataset operations

Data aggregation

Conclusion

More articles: