Using Python and Excel to process and analyze data. Part 1: Import Data and Set Up Environment

If you are just starting your way of exploring the possibilities of Python, your knowledge still has an initial level - this material is for you. In the article we describe how to extract information from the data presented in Excel files, work with them using the basic library functionality. In the first part of the article we will talk about installing the necessary libraries and setting up the environment. In the second part, we will provide an overview of libraries that can be used to download and write tables to files using Python and tell you how to work with such libraries as pandas, openpyxl, xlrd, xlutils, pyexcel.

At some point you will inevitably encounter the need to work with Excel data, and there is no guarantee that working with such data storage formats will bring you pleasure. Therefore, Python developers have implemented a convenient way to read, edit, and perform other manipulations not only with Excel files, but also with other file types.

Starting point - data availability

TRANSFER
Original article - www.datacamp.com/community/tutorials/python-excel-tutorial
Author - Karlijn Willems
')
When you start a data analysis project, you often encounter statistics collected, perhaps with counters, possibly with uploads of data from systems like Kaggle, Quandl, etc. But most of the data is still in Google or the repositories shared by other users. This data can be in Excel format or in a file with a .csv extension.

There is data, a lot of data. Analyze - I do not want. Where to begin? The first step in data analysis is their verification. In other words, it is necessary to verify the quality of the incoming data.
If the data is stored in a table, it is necessary not only to confirm the quality of the data (you need to be sure that the table data will answer the question posed for the study), but also to evaluate whether you can trust this data.

Checking the quality of the table

To check the quality of the table, usually use a simple check-list. Do the data in the table meet the following conditions:

data is statistics;
various data types: time, calculations, result;
the data is complete and consistent: the data structure in the table is systematic, and the formulas present are working.

Answers to these simple questions will allow you to understand whether your table does not contradict the standard. Of course, the given checklist is not exhaustive: there are many rules for which you can check the data in the table to make sure that the table is not an “ugly duckling”. However, the above checklist is most relevant if you want to make sure that the table contains qualitative data.

Best Practices Tabular Data

Reading these tables with Python is good. But the data also want to edit. Moreover, editing the data in the table should meet the following conditions:

the first row of the table is reserved for the heading, and the first column is used to identify the sample unit;
avoid names, values, or fields with spaces. Otherwise, each word will be interpreted as a separate variable, which will lead to errors related to the number of elements in a line in the data set. It is better to use underscores, case (the first letter of each section of the text is capitalized) or connecting words;
give preference to short names;
Try to avoid using names that contain the characters?, $,%, ^, &, *, (,), -, #,? ,,, <,>, /, |, \, [,], {, and} ;
delete any comments you made in the file to avoid additional columns or fields with a value of NA;
make sure that any missing values in the data set are displayed as NA.

After making the necessary changes (or when you carefully review your data), make sure that the changes made are saved. This is important because it will allow you to take another look at the data, edit, add or make changes, if necessary, keeping the formulas that may have been used for the calculation.

If you are working with Microsoft Excel, you probably know that there are a large number of options for saving a file in addition to the default extensions: .xls or .xlsx (go to the “file”, “save as” tab) and choose another extension (the most commonly used extensions to save data for analysis - .CSV and.THT)). Depending on the option of saving, the data fields will be separated by tabs or commas that make up the “delimiter” field. So, the data is checked and saved. We start to prepare the workspace.

Workspace preparation

Preparing the workspace is one of the first things that needs to be done in order to be confident in the qualitative result of the analysis.

The first step is to check the working directory.

When you work in a terminal, you can first go to the directory where your file is located, and then start Python. In this case, you need to make sure that the file is located in the directory from which you want to work.

To verify, give the following commands:

# Import `os` import os # Retrieve current working directory (`cwd`) cwd = os.getcwd() cwd # Change directory os.chdir("/path/to/your/folder") # List all files and directories in current directory os.listdir('.')

These commands are important not only for loading data, but also for further analysis. So, you have passed all the checks, you have saved the data and prepared the workspace. Can I start reading data in Python? :) Unfortunately, not yet. One more thing to do.

Installing packages for reading and writing Excel files

Despite the fact that you still do not know which libraries will be needed to import data, you need to make sure everything is ready to install these libraries. If you have Python 2> = 2.7.9 or Python 3> = 3.4 installed, there is no cause for concern - usually, in these versions everything is already prepared. So just make sure you updated to the latest version :)

To do this, run the following command on your computer:

 # For Linux/OS X pip install -U pip setuptools # For Windows python -m pip install -U pip setuptools

In case you have not installed pip yet, run the python script get-pip.py, which you can find here (there are installation instructions and help there too).

Anaconda installation

Installing the Anaconda Python distribution is an alternative option if you use Python for data analysis. This is an easy and fast way to start working with data analysis - after all, there is no need to install the packages necessary for data science.

This is especially convenient for newbies, but even experienced developers often follow this path, because Anakonda is a convenient way to quickly test some things without having to install each package separately.

Anaconda includes 100 of the most popular Python, R, and Scala libraries for analyzing data in several open source development environments, such as Jupyter and Spyder. If you want to get started with Jupyter Notebook, go here .

To install Anaconda - you here .

Download Excel files as Pandas DataFrame

Well, we did everything to set up the environment! Now it's time to start importing files.

One of the ways that you will often use to import files for data analysis is to import using the Pandas library (Pandas is a Python software library for processing and analyzing data). Pandas works with data on top of the NumPy library, which is a lower level tool. Pandas is a powerful and flexible library and it is very often used to structure data in order to facilitate analysis.

If you already have Pandas in Anaconda, you can simply load the files into Pandas DataFrames with pd.Excelfile ():

 # Import pandas import pandas as pd # Assign spreadsheet filename to `file` file = 'example.xlsx' # Load spreadsheet xl = pd.ExcelFile(file) # Print the sheet names print(xl.sheet_names) # Load a sheet into a DataFrame by name: df1 df1 = xl.parse('Sheet1')

If you did not install Anaconda, simply run pip install pandas to install the Pandas package in your environment, and then run the commands above.

For reading .csv files, there is a similar data loading function in the DataFrame: read_csv (). Here is an example of how you can use this function:

 # Import pandas import pandas as pd # Load csv df = pd.read_csv("example.csv")

The separator that this function will take into account is the default comma, but you can, if you want, specify an alternative separator. Go to the documentation if you want to know what other arguments you can specify to import.

How to write Pandas DataFrame to Excel file

Suppose, after analyzing the data, you want to write data to a new file. There is a way to write Pandas DataFrames data (using the to_excel function). But, before using this function, make sure that you have XlsxWriter installed, if you want to write your data on several sheets in the .xlsx file:

 # Install `XlsxWriter` pip install XlsxWriter # Specify a writer writer = pd.ExcelWriter('example.xlsx', engine='xlsxwriter') # Write your DataFrame to a file yourData.to_excel(writer, 'Sheet1') # Save the result writer.save()

Notice that the code snippet uses an ExcelWriter object to output the DataFrame. In other words, you pass the variable writer to the function to_excel (), and specify the name of the sheet. Thus, you add a data sheet to an existing book. You can also use ExcelWriter to save several different DataFrames in one book.

That is, if you just want to save one DataFrame file to a file, you can do without installing the XlsxWriter library. Just do not specify the argument that is passed to the pd.ExcelWriter () function, the remaining steps remain unchanged.

Like the functions that are used to read in .csv files, there is also a to_csv () function for writing the results back to a comma-delimited file. It works the same way as when we used it to read in the file:

 # Write the DataFrame to csv df.to_csv("example.csv")

If you want to have a separate file with a tab, you can pass a \ t argument to sep. Note that there are various other functions that can be used to output files. They can be found here .

Using the virtual environment

General advice on installing libraries is to install in a virtual Python environment without system libraries. You can use virtualenv to create isolated Python environments: it creates a folder containing everything you need to use the libraries you need for Python.

To get started with virtualenv, you first need to install it. Then go to the directory where the project will be located. Create virtualenv in this folder and upload, if needed, to a specific version of Python. After that, activate the virtual environment. Now you can start downloading other libraries and start working with them.

Remember to turn off the environment when you're done!

 # Install virtualenv $ pip install virtualenv # Go to the folder of your project $ cd my_folder # Create a virtual environment `venv` $ virtualenv venv # Indicate the Python interpreter to use for `venv` $ virtualenv -p /usr/bin/python2.7 venv # Activate `venv` $ source venv/bin/activate # Deactivate `venv` $ deactivate

Note that the virtual environment may seem problematic at first if you take the first steps in data mining with Python. And especially if you have only one project, you may not understand why a virtual environment is needed at all.

But what if you have multiple projects running at the same time and you don’t want them to use the same Python installation? Or if your projects have conflicting requirements. In such cases, the virtual environment is the ideal solution.

In the second part of the article we will talk about the main libraries for data analysis.
To be continued…

Source: https://habr.com/ru/post/331746/

All Articles

Using Python and Excel to process and analyze data. Part 1: Import Data and Set Up Environment

More articles: