📜 ⬆️ ⬇️

How to parse the Internet in goose

“Parse the site” - a phrase that put me in despair only six months ago. Familiar problems with setting up a phantom, or fussing with selenium, immediately rushed through my head. Thoughts about the possible need to replace useragent, pagination and other actions during parsing were forced to postpone this task indefinitely ...

But that all changed when I met Goose. The world of parsing has played with new colors. Under the cut I want to show a few simple examples that can help parse difficult sites.

By the way, having written the parser, Gus decided to make a movie about it, for now you can enjoy the trailer:
')




How to put a goose on the site


Everyone knows, Gus likes to pinch - weed, grandmother, geese ... and of course sites. In order for Gus to postpone his business and pinch the site for you, you just need to show him the way.

At the moment, Gus can parse:


Each of the methods has its own advantages and disadvantages. For example, Phantom can work on the server, but it is not very convenient to debug in it, launching Goose in the browser requires an external tool that Gus puts on the site. Selenium is a fairly universal solution, but at the moment Gus is just learning how to use it.

So, to run the Goose on the site, first of all you need to select and create a habitat. Possible environments:



In this article I will consider the PhantomEnvironment as the most developed at the moment.

import { PhantomEnvironment, Parser } from 'goose-parser'; const env = new PhantomEnvironment({ url: 'http://www.gooseplanet.ru/' }); const parser = new Parser({env}); 


Wednesday defines the entry point of the Goose - URL of the site's initial page.

Before parsing


Often, before we start to parse, we need to perform any actions on the page. For example, search on a goose dating site. Gus will look for you - just ask.

 const actions = [ { type: 'type', //   -  text: '', //    ,    scope: '.field[name=search]' //   }, { type: 'click', scope: 'button[type=submit]', waitForPage: true //         } ]; 


Declarativity is our everything


Gus is descriptive. He is simple and laconic.

Time to pinch


So, now we know how to find geese. It's time to shoot them the address. Suppose the layout of search results looks like this:
 <ul class="goose-babes"> <li class="goose-babe"> <img src="https://habrastorage.org/getpro/habr/post_images/3df/cbd/088/3dfcbd088e8f5b4ba060a73f8d5e3788.jpg" alt="" class="photo"> <span class="name"></span> <div> <address> №5</address> </div> </li> <li class="goose-babe"> <img src="https://habrastorage.org/getpro/habr/post_images/a48/283/87b/a4828387bbe8658749cc7a42d53ddcd9.jpg" alt="" class="photo"> <span class="name"></span> <div> <address></address> </div> </li> </ul> 


Orient the Goose by setting the rules for the location of the data in this layout:
 const rules = { scope: '.goose-babe', collection: [[ { name: 'name', scope: '.name' }, { name: 'address', scope: 'address' } ]] }; 


Run the Goose !!


 parser.parse({ actions, rules }).then(console.log); 


And we get the result:
 [ { "name": "", "address": " №5" }, { "name": "", "address": "" } ] 


And this is just the beginning.


Of course, this article only introduces readers to the Goose, and does not describe all of its features. Details about what can Gus can be found in the documentation . But still, let's list some of the tricks that Gus reserved for difficult situations:



Put the Goose a star - and he will take over the world for your sake github.com/redco/goose-parser

Source: https://habr.com/ru/post/271425/


All Articles