⬆️ ⬇️

Retrieving machine learning data

Want to learn about the three methods for obtaining data for your next ML project? Then read the translation of the Rebecca Vickery article published on the Towards Data Science blog on the Medium website! It will be interesting for beginners.





Obtaining qualitative data is the first and most important step in any machine learning project. Data Science specialists often use various methods for obtaining datasets. They can use publicly available data, as well as data that is accessed through the API or obtained from various databases, but most often combine these methods.



The purpose of this article is to provide a brief overview of three different data extraction methods using the Python language. I will tell you how to do this with Jupyter Notebook. In my previous article I wrote about the use of some commands launched in the terminal.

')

SQL



If you need to get data from a relational database, most likely you will work with the SQL language. The SQLAlchemy library allows you to associate your code in a laptop with the most common types of databases. Under the link you will find information about which databases are supported and how to bind to each type.



You can use the SQLAlchemy library to browse tables and query data, or write raw queries. To bind to the database you will need a URL with your credentials. Next, you need to initialize the create_engine method to create the connection.



 from sqlalchemy import create_engine engine = create_engine('dialect+driver://username:password@host:port/database') 


Now you can write database queries and get results.



 connection = engine.connect() result = connection.execute("select * from my_table") 


Scrapping



Web scraping is used to download data from websites and extract necessary information from their pages. There are many Python libraries available for this, but the easiest is Beautiful Soup .



You can install the package via pip.



 pip install BeautifulSoup4 


Let's take a simple example to figure out how to use it. We are going to use Beautiful Soup and the urllib library to scrap the names of the hotels and their prices from the TripAdvisor website.



First we import all the libraries we are going to work with.



 from bs4 import BeautifulSoup import urllib.request 


Now we load content of the page which we will scrap. I want to collect data on prices for hotels on the Greek island of Crete and take the URL address containing the list of hotels in this place.







The code below defines the URL as a variable and uses the urllib library to open the page, and the Beautiful Soup library to read it and return the results in a simple format. Part of the data output is shown under the code.



 URL = 'https://www.tripadvisor.co.uk/Hotels-g189413-Crete-Hotels.html' page = urllib.request.urlopen(URL) soup = BeautifulSoup(page, 'html.parser') print(soup.prettify()) 






Now let's get a list with the names of the hotels on the page. We will introduce the find_all function, which will allow you to extract parts of the document of interest to us. You can filter it differently using the find_all function to pass a single line, regular expression, or list. You can also filter one of the attributes of the tag - this is exactly the method that we apply. If you are not familiar with the tags and attributes of the HTML language, refer to this article for a brief overview.



To understand how best to provide access to the data in the tag, we need to check the code for this element on the page. We find the code to the name of the hotel by right-clicking on the name in the list, as shown in the figure below.







After clicking on the inspect item code will appear, and the section with the name of the hotel will be highlighted in color.







We see that the name of the hotel is the only piece of text in the class (class) with the name listing_title . After the class comes the code and the name of this attribute to the find_all function, as well as the div tag.



 content_name = soup.find_all('div', attrs={'class': 'listing_title'}) print(content_name) 


Each section of the code with the name of the hotel is returned as a list.







To extract the names of hotels from the code, we will use the Beautiful Soup library getText function.



 content_name_list = [] for div in content_name: content_name_list.append(div.getText().split('\n')[0]) print(content_name_list) 


Hotel names are returned as a list.







In the same way we obtain price data. The code structure for the price is shown below.







As you can see, we can work with code that is very similar to the one used for hotels.



 content_price = soup.find_all('div', attrs={'class': 'price-wrap'}) print(content_price) 


In the case of the price there is a small complexity. You can see it by running the following code:



 content_price_list = [] for div in content_price: content_price_list.append(div.getText().split('\n')[0]) print(content_price_list) 


The result is shown below. If the list of hotels shows a reduction in price, in addition to some text, both the initial price and the final price are returned. To fix this problem, we simply return the price that is relevant today.







We can use simple logic to get the latest price indicated in the text.



 content_price_list = [] for a in content_price: a_split = a.getText().split('\n')[0] if len(a_split) > 5: content_price_list.append(a_split[-4:]) else: content_price_list.append(a_split) print(content_price_list) 


This will give us the following result:







API



API - application programming interface (from the English. Application programming interface). From a data retrieval point of view, this is a web-based system that provides a data endpoint with which you can communicate through programming. Usually the data is returned in JSON or XML format.



This method will probably come in handy for you in machine learning. I’ll give a simple example of retrieving weather data from the Dark Sky public API. To connect to it, you need to register, and you will have 1000 free calls per day. This should be enough for the sample.



To access data from Dark Sky, I will use the requests library. First I need to get the correct URL for the request. In addition to the forecast, Dark Sky provides historical weather data. In this example, I’ll take them and get the correct URL from the documentation .



The structure of this URL is:



 https://api.darksky.net/forecast/[key]/[latitude],[longitude],[time] 


We will use the requests library to get

results for specific latitude and longitude, as well as date and time. Imagine that after extracting the daily data on the prices of hotels in Crete, we decided to find out whether the pricing policy is related to the weather.



For example, let's take the coordinates of one of the hotels in the list - Mitsis Laguna Resort & Spa.







First, create a URL with the correct coordinates, as well as the requested time and date. Using the requests library, we will access the data in JSON format.



 import requests request_url = 'https://api.darksky.net/forecast/fd82a22de40c6dca7d1ae392ad83eeb3/35.3378,-25.3741,2019-07-01T12:00:00' result = requests.get(request_url).json() result 


To make the results easier to read and analyze, we can convert the data into a data frame.



 import pandas as pd df = pd.DataFrame.from_dict(json_normalize(result), orient='columns') df.head() 






There are many more options for automating data extraction using these methods. In the case of web scraping, you can write different functions to automate the process and facilitate data retrieval for more days and / or places. In this article I wanted to do a review and give enough examples with the code. The following materials will be more detailed: I will explain how to create large datasets and analyze them using the methods described above.



Thank you for attention!

Source: https://habr.com/ru/post/460675/



All Articles