šŸ“œ ā¬†ļø ā¬‡ļø

Python - an assistant in finding cheap air tickets for those who love to travel

The author of the article, the translation of which we publish today, says that her goal is to talk about the development of a Python web-based scraper using Selenium, which searches for prices for airline tickets. When searching for tickets, flexible dates are used (+ - 3 days relative to specified dates). Scraper stores the search results in an Excel file and sends an email to the person who launched it, with general information about what he was able to find. The objective of this project is to assist travelers in finding the best deals.



If you, while dealing with the material, feel that you are lost - take a look at this article.

What are we looking for?


You are free to use the system described here as you like. For example, I used it to search for weekend tours and tickets to my hometown. If you are serious about finding the best tickets - you can run the script on the server (a simple server , for 130 rubles per month, quite suitable for this) and make it run one or two times a day. Search results will be emailed to you. In addition, I recommend setting everything up so that the script would save the Excel file with the search results in the Dropbox folder, which allows you to view such files from anywhere and at any time.
')

I haven’t found tariffs with errors yet, but I believe that it is possible

When searching, as already mentioned, the ā€œflexible dateā€ is used, the script finds sentences that are within three days from the given dates. Although when you run the script, you search for proposals in only one direction, it is easy to modify it so that it can collect data in several flight directions. With it, you can even look for erroneous tariffs, such finds can be very interesting.

Why do we need another web scraper?


When I first started web scraping, I, frankly, it was not particularly interesting. I wanted to do more projects in the field of predictive modeling, financial analysis, and, perhaps, in the field of analyzing the emotional coloring of texts. But it turned out that it is very interesting - to understand how to create a program that collects data from websites. As I delved into this topic, I realized that web scraping is the ā€œengineā€ of the Internet.

Perhaps you decide that this is too bold a statement. But think about how Google started with a web scraper that Larry Page created using Java and Python. Google robots have been researching and researching the Internet, trying to provide their users with the best answers to their questions. Web scraping has an infinite number of applications, and even if you are interested in something else in the field of Data Science, you will need some of the scraping skills to acquire data for analysis.

I found some of the techniques used here in a great book about web scraping, which I recently acquired. In it you can find a lot of simple examples and ideas on the practical application of the studied. In addition, there is a very interesting chapter on bypassing reCaptcha checks. For me, this was news, because I didn’t know that there were special tools and even entire services for solving such problems.

Do you like to travel?!


The simple and fairly innocuous question in the heading of this section can often be answered with a positive answer, accompanied by a couple of stories from the journeys of the one to whom it was asked. Most of us would agree that traveling is a great way to dive into new cultural environments and expand our own horizons. However, if you ask someone if he likes to look for flights, I am sure that the answer to him will be far from being so positive. In fact, this is where Python comes to the rescue.

The first task that we need to solve on the way to creating a search system for information on air tickets will be the selection of a suitable platform from which we will take information. The solution to this problem was not easy for me, but in the end I chose the Kayak service. I tried the services of Momondo, Skyscanner, Expedia, and some more, but the protection mechanisms against robots on these resources were impenetrable. After several attempts, during which, trying to convince the system that I was human, I had to deal with traffic lights, pedestrian crossings and bicycles, I decided that Kayak was the best for me, even though too many pages load in a short time, checks start too. I managed to make the bot send requests to the site in intervals of 4 to 6 hours, and everything worked fine. Periodically, difficulties arise when working with Kayak, but if you are beginning to pester with checks, then you need to either deal with them manually, then start the bot, or wait a few hours, and the checks should stop. If necessary, you can easily adapt the code for another platform, and if you do that, you can report it in the comments.

If you’re just starting to get familiar with web scraping, and don’t know why some web sites are struggling with it, then, before embarking on your first project in this area, do yourself a favor and search Google for word materials. "Web scraping etiquette". Your experiments can be completed sooner than you think, in the event that you will be engaged in web scraping unreasonable.

Beginning of work


Here is a general overview of what will happen in the code of our web scraper:


It should be noted that every Selenium project starts with a web driver. I use Chromedriver , I work with Google Chrome, but there are other options. Popularity still PhantomJS and Firefox. After downloading the driver, you need to place it in the appropriate folder, this ends the preparation for its use. In the first lines of our script, a new Chrome tab is opened.

Keep in mind that, in my story, I am not trying to open up new horizons in the search for lucrative flight deals. There are also much more advanced methods of searching for similar offers. I just want to offer the readers of this material a simple but practical way to solve this problem.

Here is the code we talked about above.

from time import sleep, strftime from random import randint import pandas as pd from selenium import webdriver from selenium.webdriver.common.keys import Keys import smtplib from email.mime.multipart import MIMEMultipart #      chromedriver! chromedriver_path = 'C:/{YOUR PATH HERE}/chromedriver_win32/chromedriver.exe' driver = webdriver.Chrome(executable_path=chromedriver_path) #     Chrome sleep(2) 

At the beginning of the code you can see the package import commands that are used throughout our project. So, randint is used to ensure that the bot would ā€œfall asleepā€ for a random number of seconds before starting a new search operation. Usually, no bot can do without it. If you run the above code, the Chrome window will open, which the bot will use to work with sites.

Let's make a small experiment and open the kayak.com website in a separate window. We will choose the city from which we are going to fly, and the city we want to go to, as well as the dates of flights. When choosing dates, check to use the + -3 day range. I wrote the code, taking into account that the site gives in response to such requests. If you, for example, need to look for tickets only for a given date, then it is highly likely that you will have to refine the bot code. Talking about the code, I make an explanation, but if you feel confused, let me know.

Now click on the button to start the search and look at the link in the address bar. It should be similar to the link that I use in the example below, where the kayak variable storing the URL is declared and the get method of the web driver is used. After clicking on the search button, the results should appear on the page.


When I used the get command more than two or three times in a few minutes, I was offered to undergo testing using reCaptcha. This test can be passed manually and continue the experiments until the system decides to arrange a new test. When I tested the script, I had the feeling that the first search session always runs without problems, so if you want to experiment with the code, you only have to periodically manually check and leave the code to run, using long intervals between the search sessions. Yes, and if you think about it, a person is unlikely to need information about ticket prices obtained at 10-minute intervals between search operations.

Work with page using XPath


So, we opened the window and loaded the site. In order to get pricing information and other information, we need to use XPath technology or CSS selectors. I decided to dwell on XPath and did not feel the need to use CSS selectors, but it is quite possible to work that way. Navigating the page using XPath can be difficult, and even if you use the methods I described in this article, where copying of the corresponding identifiers from the page code was used, I realized that this is actually not the best way to refer to necessary elements. By the way, in this book you can find an excellent description of the basics of working with pages using XPath and CSS selectors. This is what the corresponding web driver method looks like.


So, we continue to work on the bot. We take advantage of the program to select the cheapest tickets. In the following image, the XPath selector code is highlighted in red. In order to view the code, you need to right-click on the element of the page that interests you and in the appeared menu select the command Inspect. This command can be called for different elements of the page, the code of which will be displayed and highlighted in the code window.


View page code

In order to confirm my reasoning about the shortcomings of copying selectors from code, pay attention to the following features.

Here's what happens when you copy the code:

 //*[@id="wtKI-price_aTab"]/div[1]/div/div/div[1]/div/span/span 

In order to copy something similar, you need to right-click on the code section you are interested in and select the command Copy> Copy XPath in the appeared menu.

Here is what I used to define the Cheapest button:

 cheap_results = '//a[@data-code = "price"]' 


Copy> Copy XPath command

Obviously, the second option looks much simpler. When it is used, it searches for the element a, which has a data-code attribute equal to price . When using the first option, the search for the element id is equal to wtKI-price_aTab , while the XPath path to the element looks like /div[1]/div/div/div[1]/div/span/span . Such an XPath request to the page will do its job, but only once. I can say right now that id will change the next time the page loads. The wtKI symbol wtKI changes dynamically each time a page is loaded, and as a result, the code in which it is used will be useless after another page reload. So take some time to figure out the XPath. This knowledge will serve you well.

However, it should be noted that copying XPath selectors can be useful when working with fairly simple sites, and if this suits you, there is nothing wrong with that.

Now let's think about how to be if you want to get all the search results in several lines, inside the list. Very simple. Each result is located inside the object with the resultWrapper class. All results can be loaded in a loop that resembles the one shown below.

It should be noted that if you understand the above, then you should understand without any problems most of the code that we will parse. In the course of the work of this code, what we need (in fact, this is the element in which the result is wrapped), we address using some mechanism for specifying the path (XPath). This is done in order to get the text of the element and put it into an object from which data can be read (first, flight_containers used, then flights_list is flights_list ).


The first three lines are displayed and we can clearly see everything we need. However, we also have more interesting ways of obtaining information. We need to take data from each element separately.

For the work!


It is easiest to write a function to load additional results, so let's start with it. I would like to maximize the number of flights that the program receives information about, and at the same time not cause the service suspicions leading to testing, so I click the Load more results button once each time the page is displayed. In this code you should pay attention to the try block, which I added because sometimes the button does not load normally. If you also come across this, comment out the calls to this function in the start_kayak function start_kayak , which we will discuss below.

 #      ,      def load_more():   try:       more_results = '//a[@class = "moreButton"]'       driver.find_element_by_xpath(more_results).click()       #            ,          print('sleeping.....')       sleep(randint(45,60))   except:       pass 

Now, after a long parsing of this function (sometimes I can get carried away), we are ready to declare a function that will be engaged in scraping the page.

I have already collected most of what I need in the following function, called page_scrape . Sometimes the returned data about the stages of the path are combined, for their separation I use a simple method. For example, when I use the variables section_a_list and section_b_list for the first time. Our function returns the flights_df , which allows us to separate the results obtained using different data sorting methods, and later to combine them.

 def page_scrape():   """This function takes care of the scraping part"""     xp_sections = '//*[@class="section duration"]'   sections = driver.find_elements_by_xpath(xp_sections)   sections_list = [value.text for value in sections]   section_a_list = sections_list[::2] #          section_b_list = sections_list[1::2]     #     reCaptcha,    - .   #  ,  -   ,     ,       #   if        -   #    ,           #    SystemExit           if section_a_list == []:       raise SystemExit     #     A     B     a_duration = []   a_section_names = []   for n in section_a_list:       #         a_section_names.append(''.join(n.split()[2:5]))       a_duration.append(''.join(n.split()[0:2]))   b_duration = []   b_section_names = []   for n in section_b_list:       #         b_section_names.append(''.join(n.split()[2:5]))       b_duration.append(''.join(n.split()[0:2]))   xp_dates = '//div[@class="section date"]'   dates = driver.find_elements_by_xpath(xp_dates)   dates_list = [value.text for value in dates]   a_date_list = dates_list[::2]   b_date_list = dates_list[1::2]   #      a_day = [value.split()[0] for value in a_date_list]   a_weekday = [value.split()[1] for value in a_date_list]   b_day = [value.split()[0] for value in b_date_list]   b_weekday = [value.split()[1] for value in b_date_list]     #     xp_prices = '//a[@class="booking-link"]/span[@class="price option-text"]'   prices = driver.find_elements_by_xpath(xp_prices)   prices_list = [price.text.replace('$','') for price in prices if price.text != '']   prices_list = list(map(int, prices_list))   # stops -   ,         ,   -     xp_stops = '//div[@class="section stops"]/div[1]'   stops = driver.find_elements_by_xpath(xp_stops)   stops_list = [stop.text[0].replace('n','0') for stop in stops]   a_stop_list = stops_list[::2]   b_stop_list = stops_list[1::2]   xp_stops_cities = '//div[@class="section stops"]/div[2]'   stops_cities = driver.find_elements_by_xpath(xp_stops_cities)   stops_cities_list = [stop.text for stop in stops_cities]   a_stop_name_list = stops_cities_list[::2]   b_stop_name_list = stops_cities_list[1::2]     #   -,          xp_schedule = '//div[@class="section times"]'   schedules = driver.find_elements_by_xpath(xp_schedule)   hours_list = []   carrier_list = []   for schedule in schedules:       hours_list.append(schedule.text.split('\n')[0])       carrier_list.append(schedule.text.split('\n')[1])   #          a  b   a_hours = hours_list[::2]   a_carrier = carrier_list[1::2]   b_hours = hours_list[::2]   b_carrier = carrier_list[1::2]     cols = (['Out Day', 'Out Time', 'Out Weekday', 'Out Airline', 'Out Cities', 'Out Duration', 'Out Stops', 'Out Stop Cities',           'Return Day', 'Return Time', 'Return Weekday', 'Return Airline', 'Return Cities', 'Return Duration', 'Return Stops', 'Return Stop Cities',           'Price'])   flights_df = pd.DataFrame({'Out Day': a_day,                              'Out Weekday': a_weekday,                              'Out Duration': a_duration,                              'Out Cities': a_section_names,                              'Return Day': b_day,                              'Return Weekday': b_weekday,                              'Return Duration': b_duration,                              'Return Cities': b_section_names,                              'Out Stops': a_stop_list,                              'Out Stop Cities': a_stop_name_list,                              'Return Stops': b_stop_list,                              'Return Stop Cities': b_stop_name_list,                              'Out Time': a_hours,                              'Out Airline': a_carrier,                              'Return Time': b_hours,                              'Return Airline': b_carrier,                                                     'Price': prices_list})[cols]     flights_df['timestamp'] = strftime("%Y%m%d-%H%M") #      return flights_df 

I tried to name the variables so that the code would be understandable. Remember that variables beginning with a refer to the first stage of the path, and b to the second. Go to the next function.

Auxiliary mechanisms


We now have a function that allows you to load additional search results and a function to process these results. This article could have been completed on this, since these two functions provide everything needed for scraping pages that can be opened independently. But we have not yet considered some auxiliary mechanisms, which were discussed above. For example - this is the code for sending emails and some more things. All this can be found in the start_kayak function, which we will now consider.

This function requires information about cities and dates. Using this information, she forms a link in the kayak variable, which is used to go to the page where the search results will be located, sorted by their best match to the query. After the first scraping session, we will work with the prices in the table at the top of the page. Namely, we find the minimum ticket price and the average price. All this, together with the prediction issued by the site, will be sent by e-mail. On the page the corresponding table should be in the upper left corner. Working with this table, by the way, can cause an error when searching using exact dates, since in this case the table on the page is not displayed.

 def start_kayak(city_from, city_to, date_start, date_end):   """City codes - it's the IATA codes!   Date format -  YYYY-MM-DD"""     kayak = ('https://www.kayak.com/flights/' + city_from + '-' + city_to +            '/' + date_start + '-flexible/' + date_end + '-flexible?sort=bestflight_a')   driver.get(kayak)   sleep(randint(8,10))     #    ,           try   try:       xp_popup_close = '//button[contains(@id,"dialog-close") and contains(@class,"Button-No-Standard-Style close ")]'       driver.find_elements_by_xpath(xp_popup_close)[5].click()   except Exception as e:       pass   sleep(randint(60,95))   print('loading more.....')  #     load_more()     print('starting first scrape.....')   df_flights_best = page_scrape()   df_flights_best['sort'] = 'best'   sleep(randint(60,80))     #      ,        matrix = driver.find_elements_by_xpath('//*[contains(@id,"FlexMatrixCell")]')   matrix_prices = [price.text.replace('$','') for price in matrix]   matrix_prices = list(map(int, matrix_prices))   matrix_min = min(matrix_prices)   matrix_avg = sum(matrix_prices)/len(matrix_prices)     print('switching to cheapest results.....')   cheap_results = '//a[@data-code = "price"]'   driver.find_element_by_xpath(cheap_results).click()   sleep(randint(60,90))   print('loading more.....')  #     load_more()     print('starting second scrape.....')   df_flights_cheap = page_scrape()   df_flights_cheap['sort'] = 'cheap'   sleep(randint(60,80))     print('switching to quickest results.....')   quick_results = '//a[@data-code = "duration"]'   driver.find_element_by_xpath(quick_results).click()    sleep(randint(60,90))   print('loading more.....')  #     load_more()     print('starting third scrape.....')   df_flights_fast = page_scrape()   df_flights_fast['sort'] = 'fast'   sleep(randint(60,80))     #     Excel-,         final_df = df_flights_cheap.append(df_flights_best).append(df_flights_fast)   final_df.to_excel('search_backups//{}_flights_{}-{}_from_{}_to_{}.xlsx'.format(strftime("%Y%m%d-%H%M"),                                                                                  city_from, city_to,                                                                                  date_start, date_end), index=False)   print('saved df.....')     #    ,  ,  ,      xp_loading = '//div[contains(@id,"advice")]'   loading = driver.find_element_by_xpath(xp_loading).text   xp_prediction = '//span[@class="info-text"]'   prediction = driver.find_element_by_xpath(xp_prediction).text   print(loading+'\n'+prediction)     #    loading   , , ,        #    -    "Not Sure"   weird = 'ĀÆ\\_(惄)_/ĀÆ'   if loading == weird:       loading = 'Not sure'     username = 'YOUREMAIL@hotmail.com'   password = 'YOUR PASSWORD'   server = smtplib.SMTP('smtp.outlook.com', 587)   server.ehlo()   server.starttls()   server.login(username, password)   msg = ('Subject: Flight Scraper\n\n\ Cheapest Flight: {}\nAverage Price: {}\n\nRecommendation: {}\n\nEnd of message'.format(matrix_min, matrix_avg, (loading+'\n'+prediction)))   message = MIMEMultipart()   message['From'] = 'YOUREMAIL@hotmail.com'   message['to'] = 'YOUROTHEREMAIL@domain.com'   server.sendmail('YOUREMAIL@hotmail.com', 'YOUROTHEREMAIL@domain.com', msg)   print('sent email.....') 

I tested this script using an Outlook account (hotmail.com). I did not check it for correctness with the Gmail account, this mail system is very popular, but there are lots of possible options. If you are using a Hotmail account, then you just need to enter your data into the code in order for it to work.

If you want to figure out what exactly is performed in certain parts of the code for this function, you can copy them and experiment with them. Experimenting with code is the only way to understand it.

Ready system


Now that everything we have said is done, we can create a simple loop in which our functions are called. The script prompts the user for data on cities and dates. When testing with a constant restart of the script, you are unlikely to want to enter this data manually each time, so the relevant lines, for the duration of the test, can be commented out by uncommenting those that go below them, in which the data you need are strictly specified.

 city_from = input('From which city? ') city_to = input('Where to? ') date_start = input('Search around which departure date? Please use YYYY-MM-DD format only ') date_end = input('Return when? Please use YYYY-MM-DD format only ') # city_from = 'LIS' # city_to = 'SIN' # date_start = '2019-08-21' # date_end = '2019-09-07' for n in range(0,5):   start_kayak(city_from, city_to, date_start, date_end)   print('iteration {} was complete @ {}'.format(n, strftime("%Y%m%d-%H%M")))     #  4    sleep(60*60*4)   print('sleep finished.....') 

Here is the test run of the script.

Test run script

Results


If you get to this point - congratulations! Now you have a working web scraper, although I already see many ways to improve it. For example, it can be integrated with Twilio so that, instead of e-mails, it sends text messages. You can use VPN or something else to simultaneously receive results from multiple servers. There is also a recurring problem with checking the user of the site on whether he is a person, but this problem can be solved. In any case, you now have a base that you can expand if you wish. For example, to make the Excel file go to the user as an attachment to an email.

Source: https://habr.com/ru/post/451872/


All Articles