Today we will talk about the package
Pandas . This package makes Python a powerful data analysis tool. The package allows you to build pivot tables, perform groupings, provides convenient access to tabular data, and, if available, the
matplotlib package allows you to draw graphs on the resulting data sets. Next, the basics of working with the package will be shown, such as loading data, accessing fields, filtering and building pivot ones.
Basic data structures and their loading
First, let's say a couple of words about the storage structure in Pandas. The main ones are Series and DataFrame.
Series is an indexed one-dimensional array of values. It is similar to a simple dict type dictionary, where the element name will correspond to the index, and the value to the record value.
A DataFrame is an indexed multidimensional array of values, and therefore each DataFrame column is a Series structure.
So, with the structures a bit sorted out. We proceed directly to working with the package. To start analyzing any data you need to download them. Pandas provides a wide range of data sources, for example:
- SQL
- Text files
- Excel files
- HTML
More information about them can be found in the
documentation .
For example, download 2 text files. This can be done with the
read_csv () function:
from pandas import read_csv df1 = read_csv("df1.txt") df2 = read_csv("df2.txt",";")
Now we have 2 df1 datasets containing stores and the number of shipments:
shop | qty |
---|
427 | 3 |
707 | four |
957 | 2 |
437 | one |
And the df2 containing the store and its city:
shop | name |
---|
347 | Kiev |
427 | Samara |
707 | Minsk |
957 | Irkutsk |
437 | Moscow |
Basic dataset operations
You can perform various actions on datasets, such as merging, adding columns, adding records, filtering, building aggregates, and others. Let us now, in order to demonstrate all the possibilities described above, the following tasks:
- in the set with the cities of the stores, add the field `country` and fill in the relevant countries
- choose the Ukrainian store and change its number
- add the store, obtained in the previous step, to the general list
- add the amount of df1 to the df2 set
- we will build a summary table by country and number of shipments
So, to add a new column to the data set, there is an
insert () command:
country = [u'',u'',u'',u'',u''] df2.insert(1,'country',country)
In our case, the function is passed 3 arguments:
- position number where new column will be inserted
- new column name
- an array of column values (in our case, this is a regular list)
View df2 after performing the above operations:
shop | country | name |
---|
347 | Ukraine | Kiev |
427 | RF | Samara |
707 | Belarus | Minsk |
957 | RF | Irkutsk |
437 | RF | Moscow |
Now you need to choose a store whose country will be equal to `Ukraine`. There are 2 ways to access the columns in the DataFrame:
- through point - Data Set. Field Name
- in square brackets - Dataset ['FieldName']
In order to filter the dataset, you can use the square brackets inside which there will be a
Dataset [condition] condition . The condition must contain the field names, in the format described above, and the condition imposed on them. Thus, you can choose to take the store of interest and replace its number with the following:
t = df2[df2.country == u''] t.shop = 345
The result of the execution of this code will be a new intermediate data set t, containing one record:
shop | country | name |
---|
345 | Ukraine | Kiev |
In order to add the record obtained in the previous step, you need to execute the function append (), which takes as an argument the data set that needs to be added to the source:
df2 = df2.append(t)
Data aggregation
Now to our main df2 store list, you can pull up the quantity from the df1 dataset. This can be done using the
merge () function, which combines two data sets (analog of join in SQL):
res = df2.merge(df1, 'left', on='shop')
As parameters, the function takes:
- data set (which will be attached to the source)
- connection type
- connection field
More information about the parameters can be found in the documentation. The data set before the final operation looks like this:
shop | country | name | qty |
---|
347 | Ukraine | Kiev | NaN |
427 | RF | Samara | 3 |
707 | Belarus | Minsk | four |
957 | RF | Irkutsk | 2 |
437 | RF | Moscow | one |
345 | Ukraine | Kiev | NaN |
It remains to build a pivot table in order to understand how much each country has shipped. For this there is a function
pivot_table () . In our example, the function takes as parameters:
- list of columns by which the aggregated value will be calculated
- list of columns that will be rows of the summary table
- function that is used for aggregation
- parameter to replace empty values with 0
The code for building a summary looks like this:
res.pivot_table(['qty'],['country'], aggfunc='sum', fill_value = 0)
The final table will look like this:
country | qty |
---|
Belarus | four |
RF | 6 |
Ukraine | 0 |
Conclusion
As a conclusion, I would like to say that Pandas is a good alternative to Excel when working with large amounts of data. The functions shown are just the tip of the iceberg called Pandas. In the future, I plan to write a series of articles in which the full power of this package will be shown.