📜 ⬆️ ⬇️

Introduction to data analysis with Pandas

Today we will talk about the package Pandas . This package makes Python a powerful data analysis tool. The package allows you to build pivot tables, perform groupings, provides convenient access to tabular data, and, if available, the matplotlib package allows you to draw graphs on the resulting data sets. Next, the basics of working with the package will be shown, such as loading data, accessing fields, filtering and building pivot ones.

Basic data structures and their loading


First, let's say a couple of words about the storage structure in Pandas. The main ones are Series and DataFrame.
Series is an indexed one-dimensional array of values. It is similar to a simple dict type dictionary, where the element name will correspond to the index, and the value to the record value.
A DataFrame is an indexed multidimensional array of values, and therefore each DataFrame column is a Series structure.
So, with the structures a bit sorted out. We proceed directly to working with the package. To start analyzing any data you need to download them. Pandas provides a wide range of data sources, for example:

More information about them can be found in the documentation .
For example, download 2 text files. This can be done with the read_csv () function:
from pandas import read_csv df1 = read_csv("df1.txt") df2 = read_csv("df2.txt",";") #    

Now we have 2 df1 datasets containing stores and the number of shipments:
shopqty
4273
707four
9572
437one

And the df2 containing the store and its city:
shopname
347Kiev
427Samara
707Minsk
957Irkutsk
437Moscow

Basic dataset operations


You can perform various actions on datasets, such as merging, adding columns, adding records, filtering, building aggregates, and others. Let us now, in order to demonstrate all the possibilities described above, the following tasks:
  1. in the set with the cities of the stores, add the field `country` and fill in the relevant countries
  2. choose the Ukrainian store and change its number
  3. add the store, obtained in the previous step, to the general list
  4. add the amount of df1 to the df2 set
  5. we will build a summary table by country and number of shipments

So, to add a new column to the data set, there is an insert () command:
 country = [u'',u'',u'',u'',u''] df2.insert(1,'country',country) 

In our case, the function is passed 3 arguments:
  1. position number where new column will be inserted
  2. new column name
  3. an array of column values ​​(in our case, this is a regular list)

View df2 after performing the above operations:
shopcountryname
347UkraineKiev
427RFSamara
707BelarusMinsk
957RFIrkutsk
437RFMoscow

Now you need to choose a store whose country will be equal to `Ukraine`. There are 2 ways to access the columns in the DataFrame:

In order to filter the dataset, you can use the square brackets inside which there will be a Dataset [condition] condition . The condition must contain the field names, in the format described above, and the condition imposed on them. Thus, you can choose to take the store of interest and replace its number with the following:
 t = df2[df2.country == u''] t.shop = 345 

The result of the execution of this code will be a new intermediate data set t, containing one record:
shopcountryname
345UkraineKiev

In order to add the record obtained in the previous step, you need to execute the function append (), which takes as an argument the data set that needs to be added to the source:
 df2 = df2.append(t) 

Data aggregation


Now to our main df2 store list, you can pull up the quantity from the df1 dataset. This can be done using the merge () function, which combines two data sets (analog of join in SQL):
 res = df2.merge(df1, 'left', on='shop') 

As parameters, the function takes:

More information about the parameters can be found in the documentation. The data set before the final operation looks like this:
shopcountrynameqty
347UkraineKievNaN
427RFSamara3
707BelarusMinskfour
957RFIrkutsk2
437RFMoscowone
345UkraineKievNaN

It remains to build a pivot table in order to understand how much each country has shipped. For this there is a function pivot_table () . In our example, the function takes as parameters:

The code for building a summary looks like this:
 res.pivot_table(['qty'],['country'], aggfunc='sum', fill_value = 0) 

The final table will look like this:
countryqty
Belarusfour
RF6
Ukraine0

Conclusion


As a conclusion, I would like to say that Pandas is a good alternative to Excel when working with large amounts of data. The functions shown are just the tip of the iceberg called Pandas. In the future, I plan to write a series of articles in which the full power of this package will be shown.

')

Source: https://habr.com/ru/post/196980/


All Articles