What we should parse site. Webdriver API Basics

Search for housing , information about products, vacancies, acquaintances , comparison of the company's products with competitors, research of reviews in the network.

The Internet has a lot of useful information and the ability to extract data will help in life and work. Learn how to get information using the webdriver API. In the publication I will give two examples whose code is available on github. At the end of the article is a screencast about how the program manages the browser.

A program or a script with the help of a web driver controls your browser - performs text input, clicking on links, clicking on an element, extracting data from the page and its elements, taking screenshots of the site, etc.
To work webdriver you need two components: a browser / server protocol and the client part in the form of a library for your programming language.

You can use the webdriver API from different programming languages ​​and virtual machines: there are official webdriver clients for C #, Ruby, Python, Javascript (Node), as well as clients from the community for Perl, Perl 6, PHP, Haskell, Objective-C, Javascript , R, Dart, Tcl.

Webdriver is currently the W3C standard that is still being worked on . Initially, the Webdriver API appeared in the selenium project for testing purposes, as a result of the evolution of the Selenium-RC API.

As a server, a separate process “understands” the protocol language is used. This process controls your browser. The following drivers are available:

Two drivers stand out from this list:

The essence of technology ...

Minimum theory for further work. Sequence of actions in the client

So, our choice of phantomjs . This is a full-fledged browser, which is controlled by the webdriver protocol. You can run many of its processes at the same time, the graphics subsystem is not required, javascript is fully executed inside (in contrast with the restrictions of htmlunit). If you write scripts in javascript and pass it as a parameter at startup, then phantomJS can execute them without the protocol web driver and even debugging is available using another browser.

Described below refers mostly to the API for java / groovy. In clients of other languages, the list of functions and parameters should be similar.

We get a server with a web driver.

String phantomJsPath = PhantomJsDowloader.getPhantomJsPath() 

Loads phantomjs from maven repository, unpacks and returns the path to this browser. To use, you need to connect to the project library from maven: com.github.igor-suhorukov: phantomjs-runner: 1.1.

You can skip this step if you have previously installed a web driver for your browser in the local file system.

Create a client, connect to the server

 WebDriver driver = new PhantomJSDriver(settings) 

Configures the port for interaction via webdriver protocol, starts the phantomjs process and connects to it.

Open the desired page in the browser


Opens the page for the specified address in the browser.

We get information of interest to us

 WebElement leftmenu = driver.findElement(By.id("leftmenu")) List<WebElement> linkList = leftmenu.findElements(By.tagName("a")) 

The driver instance and the item derived from it have two useful methods: findElement, findElements. The first returns an item or throws a NoSuchElementException if the item is not found. The second returns a collection of items.

Elements can be selected by the following queries org.openqa.selenium.By:

I will actively use id, tagName and xpath. For those not familiar with xpath - I recommend to look at examples or articles, and only then go on to read the specification.

Perform actions on items on the page and on the page.

You can do the following with an element:


Takes a snapshot of the browser window. A useful addition to the standard snapshot function is to recommend the aShot library - it allows you to take a snapshot of only a specific item in a window and allows you to compare items as images.

Screenshots can be obtained as:

Close browser connection


Closes the connection by protocol and in our case stops the phantomjs process.

Example 1: Walking groovy script on social network profiles

Run the command:

 java -jar groovy-grape-aether- crawler.groovy http://??.com/catalog.php 

Links to the necessary files to run:

The script in the console prints the path to the html file, which is based on information from the social network. In the page you will see the user name, the time of the last visit to the social network and a screenshot of the entire user page.

Why run a script with groovy-grape-aether-
Groovy -grape-aether- recently talked about assembling a groove in an article titled “Street Magic in Scripts or what does Groovy, Ivy and Maven connect?” . The main difference from groovy-all-2.4.5.jar is the ability of the Grape mechanism to work with repositories in a more correct way compared to ivy , using the aether library, as well as having access classes to the repositories in the assembly.

 package com.github.igorsuhorukov.phantomjs @Grab(group='commons-io', module='commons-io', version='2.2') import org.apache.commons.io.IOUtils @Grab(group='com.github.detro', module='phantomjsdriver', version='1.2.0') import org.openqa.selenium.* import org.openqa.selenium.phantomjs.PhantomJSDriver import org.openqa.selenium.phantomjs.PhantomJSDriverService import org.openqa.selenium.remote.DesiredCapabilities @Grab(group='com.github.igor-suhorukov', module='phantomjs-runner', version='1.1') import com.github.igorsuhorukov.phantomjs.PhantomJsDowloader public class Crawler { public static final java.lang.String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" public static void run(String baseUrl) { def phantomJsPath = PhantomJsDowloader.getPhantomJsPath() def DesiredCapabilities settings = new DesiredCapabilities() settings.setJavascriptEnabled(true) settings.setCapability("takesScreenshot", true) settings.setCapability("userAgent", com.github.igorsuhorukov.phantomjs.Crawler.USER_AGENT) settings.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY, phantomJsPath) def WebDriver driver = new PhantomJSDriver(settings) def randomUrl = null def lastVisited=null def name=null boolean pass=true while (pass){ try { randomUrl = getUrl(driver, baseUrl) driver.get(randomUrl) def titleElement = driver.findElement(By.id("title")) lastVisited = titleElement.findElement(By.id("profile_time_lv")).getText() name = titleElement.findElement(By.tagName("a")).getText() pass=false } catch (NoSuchElementException e) { System.out.println(e.getMessage()+". Try again.") } } String screenshotAs = driver.getScreenshotAs(OutputType.BASE64) File resultFile = File.createTempFile("phantomjs", ".html") OutputStreamWriter streamWriter = new OutputStreamWriter(new FileOutputStream(resultFile), "UTF-8") IOUtils.write("""<html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"></head><body> <p>${name}</p><p>${lastVisited}</p> <img alt="Embedded Image" src="data:image/png;base64,${screenshotAs}"></body> </html>""", streamWriter) IOUtils.closeQuietly(streamWriter) println "html ${resultFile} created" driver.quit(); } static String getUrl(WebDriver driver, String baseUrl) { driver.get(baseUrl) def elements = driver.findElements(By.xpath("//div[@id='content']//a")) def element = elements.get((int) Math.ceil(Math.random() * elements.size())) String randomUrl = element.getAttribute("href") randomUrl.contains("catalog") ? getUrl(driver, randomUrl) : randomUrl } } Crawler.run(this.args.getAt(0)) 

Well-versed grunts will notice that Geb is a better solution. But since it hides all the work with webdriver behind its DSL, Geb is not suitable for our educational purposes. For aesthetic reasons, I agree with you!

Example 2: Retrieving project data from a java-source java program

An example is available here . To run it, java8 is needed, since streams and try-with-resources are used.

 git clone https://github.com/igor-suhorukov/java-webdriver-example.git mvn clean package -Dexec.args="http://java-source.net" 

In this example, I use xpath and axis to extract information from the page. As an example, a fragment of the Project class.

 WebElement main = driver.findElement(By.id("main")); name = main.findElement(By.tagName("h3")).getText(); description = main.findElement(By.xpath("//h3/following-sibling::table/tbody/tr/td[1]")).getText(); link = main.findElement(By.xpath("//td[text()='HomePage']/following-sibling::*")).getText(); license = main.findElement(By.xpath("//td[text()='License']/following-sibling::*")).getText(); 

Part of the data extracted from the site. Files projects.xml - the result of the program
This is how the same example works with the ChromeDriver driver (org.seleniumhq.selenium: selenium-chrome-driver: 2.48.2). Unlike PhantomJS, in this case, you can see what happens during the launch of the program: following links, page rendering.


Webdriver API can be used from different programming languages. It is quite simple to write a script or program to control the browser and extract information from the pages: it is convenient to receive data from the page by Id tag, CSS selector or XPath expression. It is possible to take pictures of the page and individual elements on it. Based on examples and documentation, you can develop scripts of almost any complexity for working with the site. For development and debugging it is better to use a regular browser and web driver for it. PhantomJS is better suited for fully automatic operation.

Good luck in extracting open and useful information from the web!

