“Parse the site” - a phrase that put me in despair only six months ago. Familiar problems with setting up a phantom, or fussing with selenium, immediately rushed through my head. Thoughts about the possible need to replace useragent, pagination and other actions during parsing were forced to postpone this task indefinitely ...
But that all changed when I met Goose. The world of parsing has played with new colors. Under the cut I want to show a few simple examples that can help parse difficult sites.
By the way, having written the parser, Gus decided to make a movie about it, for now you can enjoy the trailer: ')
How to put a goose on the site
Everyone knows, Gus likes to pinch - weed, grandmother, geese ... and of course sites. In order for Gus to postpone his business and pinch the site for you, you just need to show him the way.
At the moment, Gus can parse:
inside nodejs using PhantomJS;
right in the browser (ideally if you are writing a browser plugin)
using selenium
Each of the methods has its own advantages and disadvantages. For example, Phantom can work on the server, but it is not very convenient to debug in it, launching Goose in the browser requires an external tool that Gus puts on the site. Selenium is a fairly universal solution, but at the moment Gus is just learning how to use it.
So, to run the Goose on the site, first of all you need to select and create a habitat. Possible environments:
PhantomEnvironment
BrowserEnvironment
SeleniumEnvironment
In this article I will consider the PhantomEnvironment as the most developed at the moment.
import { PhantomEnvironment, Parser } from'goose-parser'; const env = new PhantomEnvironment({ url: 'http://www.gooseplanet.ru/' }); const parser = new Parser({env});
Wednesday defines the entry point of the Goose - URL of the site's initial page.
Before parsing
Often, before we start to parse, we need to perform any actions on the page. For example, search on a goose dating site. Gus will look for you - just ask.
Of course, this article only introduces readers to the Goose, and does not describe all of its features. Details about what can Gus can be found in the documentation . But still, let's list some of the tricks that Gus reserved for difficult situations:
Gus is not afraid of pagination - he can scroll pages or click on links to get new content. You can even teach Goose custom pagination;
Gus is able to walk between pages, it is important and graceful;
Gus can perform the necessary actions with the keyboard and mouse to obtain a specific piece of information - he has very nimble legs;
Gus can convert the results obtained;
Gus is a smart, intelligent animal, you can tell him new parsing chips and he will start using them - the Gus Parser API is very easy to expand;
Goose is a communal animal, due to its love for nodejs, it can easily live on a goose farm and parse millions of sites at a time;