πŸ“œ ⬆️ ⬇️

Watir: easy parsing of complex sites

image Anyone who writes parsers knows that you can parse a hundred sites and get stuck for a few days on the hundred-first. The structure of the next frostbitten site can be arbitrarily complex, and when it comes to compressed javascript and ajax requests, it can become more expensive than the information itself to decrypt them and extract information using ordinary curl and retexps.

Roughly speaking, the problem is that javascript is running in the browser, but not on the server. You need to either write the js interpreter in one of the server languages ​​( jParser and jTokenizer ), or put a browser on the server, send requests to it and pull out the resulting dom-tree.

In antiquity, in such cases, we built our bike: we ran a browser on a separate machine, js in it, which constantly knocked on the server and received tasks (jobs) from it, the site itself was loaded into the iframe, and the script sent the dom-tree from the outside back to server.
')
Now there are more advanced tools - xulrunner ( crowbar ) and watir . The first is headless firefox. Crowbar even has an ff-plugin for visually highlighting the necessary data , which is generated by a special parser-js-code, however cookies are not supported there, but the reluctance to dole. Watir is positioned by developers as a debugging tool, but we will use it for its intended purpose and, as an example, we will get some data from the site travelocity.com .

Watir is a ruby ​​gem through which the interaction with the browser takes place. There are versions for different platforms - watir, firewatir and safariwatir. Despite the detailed installation manual , I had problems in both Windows and Ubunt. In windows (ie6), watir does not work on ruby ​​1.9.1. I had to install version 1.8.6, then it worked. In ubunt, in order for FireWatir to work (or the usual watir via firefox), you need to put the jssh plugin in your browser. But the version offered for FireWatir on the install page did not work with my FireFox 3.6 on Ubuntu 10.04.

To check if jssh works for you or not, you need to launch firefox -jssh and then send something to port 9997 ( telnet localhost 9997 ). If the port does not open, or firefox crashes (like mine), then you need to collect your jssh, detailed instructions for building are here .

Let's start writing a hotel parser with travelocity.com . For example, we choose the prices of rooms in all hotels in the direction of New York, NY, USA for today. We will work with FireWatir on Ubuntu 10.4.

Launch the browser and load the page with the form:

require "rubygems"<br>require "firewatir"<br>ff = FireWatir::Firefox.new<br>ff.goto("http://www.travelocity.com/Hotels")<br>
Fill out the form with the necessary values ​​and make submit:

ff.text_field(:id,"HO_to").val("New York, NY, USA")<br>ff.text_field(:id,"HO_fromdate").val(Time.now.strftime("%m/%d/%Y"))<br>ff.text_field(:id,"HO_todate").val(Time.tomorrow.strftime("%m/%d/%Y"))<br>ff.form(:name,"formHO").submit<br>
We are waiting for the end of the download:

ff.wait_until{ff.div(:id,"resultsList").div(:class,"module").exists?}<br>
wait_until is a very important instruction. When submitting a form on the site, several redirects are made, and after - ajax request. You need to wait for the final page load, and only AFTER this work with the dom-tree. How do I know that the page has loaded? Need to see what elements appear on the page after ajax. In our case, after a request to /pub/gwt/hotel/esf/hotelresultlist.gwt-rpc, several <div class="module"> elements appear in the resultsPage. We are waiting until they appear. I note that some commands, such as text_field, submit, already include wait_until, so this command is not needed before them.

Now we are going through the pages:

while true do<br> ff.wait_until{ff.div(:id,"resultsList").div(:class,"module").exists?}<br> ...<br> next_link = ff.div(:id,"resultcontrol-top").link(:text,"Next")<br> if (next_link.exists?) then next_link.click else break end<br>end<br>
Where there is an ellipsis in the code, there is a direct pulling of the data. There is a temptation to apply watir and in this case, for example, run on all the divs in the resultsList with this command:

ff.div(:id,"resultsList").divs.each.do |div|<br> if (div.class_name != "module") then next end<br> ...<br>end<br>
And from each diva pull out the hotel name and price:

m = div.h2(:class,"property-name").html.match(/propertyId=(\d+)[^<>]*>([^<>]*)<\/a[^<>]*>/)<br>data["id"] = m[1] unless m.nil?<br>data["name"] = m[2] unless m.nil?<br>data["price"] = div.h3(:class,"price").text<br>
But this should not be done. Each watir command to dom-tree elements is an extra request to the browser. I work for about a second. Much more efficiently for the same second at a time to pull out the entire dom and instantly parse the regular regulars:

ff.div(:id,"resultsList").html.split(/<div[^<>]*class\s*=\s*["']?module["']?[^<>]*>/).each do |str|<br>m = str.match(/<a[^<>]*propertyId=(\d+)[^<>]*>([\s\S]*?)<\/a[^<>]*>/)<br> data["id"] = m[1] unless m.nil?<br> data["name"] = m[2] unless m.nil?<br> m = str.match(/<h3[^<>]*class\s*=\s*["']?price["']?[^<>]*>([\s\S]*?)<\/h3[^<>]*>/)<br> data["price"] = m[1] unless m.nil?<br>end<br>
I advise you to use watir only where necessary. Filling and submitting forms, waiting until the browser executes js code, and then - getting the final html-code. Yes, access to the values ​​of elements via watir seems more reliable than parsing the code flow without a dom structure. To pull out the inside of some diva, inside of which there may be other divs, you need to write a hard-to-read regular expression. But still it is much faster. If there are many such divs, the simplest solution is to use a simple recursive function to break the whole code by tag nesting level. I wrote such a thing in one of its class in php .

Source: https://habr.com/ru/post/109835/


All Articles