puppeteer
you can create programs for automatic data collection from web sites, so-called web scrapers, imitating the actions of a regular user. In such scenarios, a browser without a user interface, the so-called “Headless Chrome”, can be used. Using puppeteer
, you can control the browser, which is running normally, which is especially useful when debugging programs.puppeteer
. The author of the material sought to ensure that the article was interesting to the widest possible audience of programmers, therefore, those web developers who already have some experience with puppeteer
and those who first encounter such a concept as “ Headless Chrome.puppeteer
. Along with it, the current version of Chromium will be installed, which is guaranteed to work with the API of interest to us. You can do this with the following command: npm install --save puppeteer
puppeteer
let's take a simple example. He, with minor changes, repeats the documentation for the library. The code that we are going to look at now makes a screenshot of the given web page.test.js
and put the following into it: const puppeteer = require('puppeteer'); async function getPic() { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://google.com'); await page.screenshot({path: 'google.png'}); await browser.close(); } getPic();
const puppeteer = require('puppeteer');
puppeteer
library as a dependency. async function getPic() { ... }
getPic()
. This function contains code that automates the work with the browser. getPic();
getPic()
function, that is, execute it.getPic()
function is asynchronous, it is defined with the async
. It uses the async / await
construction from ES 2017. Since getPic()
is an asynchronous function, it returns a Promise
object when called. Such objects are usually called "promises". When a function defined with the async
terminates and returns a value, the promise will either be allowed (if the operation completes successfully) or is rejected (if an error occurs).async
keyword when defining a function, we can execute calls of other functions in it with the await
keyword. It suspends the execution of the function and allows waiting for the permission of the corresponding promise, after which the function will continue. If this is all you do not understand yet - just read further and gradually everything will start to fall into place.getPic()
function code. const browser = await puppeteer.launch();
puppeteer
. In fact, this means that we launch an instance of the Chrome browser and write a link to it in the browser
constant just created. Since the await
keyword is used in this line, the execution of the main function will be suspended until the corresponding promise is enabled. In this case, this means waiting for either the successful launch of the Chrome instance or the occurrence of an error. const page = await browser.newPage();
page
constant. await page.goto('https://google.com');
page
variable created in the previous line, we can give the page a command to go to the specified URL. In this example, we are going to https://google.com
. The execution of the code, as in the previous lines, will pause until the completion of the operation. await page.screenshot({path: 'google.png'});
puppeteer
create a screenshot of the current page represented by the page
constant. The screenshot()
method takes, as a parameter, an object. Here you can specify the path by which you want to save the screenshot in .png
format. Again, the await
keyword is used here, which causes the function to be suspended until the operation is completed. await browser.close();
getPic()
function getPic()
and we close the browser.test.js
file, can be run using Node as follows: node test.js
const browser = await puppeteer.launch();
const browser = await puppeteer.launch({headless: false});
node test.js
{headless: false}
object as a parameter when launching the browser, we can observe how the code controls the operation of Google Chrome. await page.setViewport({width: 1000, height: 500})
const puppeteer = require('puppeteer'); async function getPic() { const browser = await puppeteer.launch({headless: false}); const page = await browser.newPage(); await page.goto('https://google.com'); await page.setViewport({width: 1000, height: 500}) await page.screenshot({path: 'google.png'}); await browser.close(); } getPic();
puppeteer
, let's look at a more complex example in which we will collect data from web pages.puppeteer
. You can pay attention to the fact that there are a huge number of different methods that allow us not only to simulate mouse clicks on page elements, but also to fill out forms and read data from pages.test.js
file, create a scrape.js
file and paste the following blank there: const puppeteer = require('puppeteer'); let scrape = async () => { // ... // }; scrape().then((value) => { console.log(value); // ! });
puppeteer
. Next, we have the scrape()
function, in which, below, we will add the code for scraping. This function will return some value. Finally, we call the scrape()
function and work with what it returned. In this case, just output it to the console.scrape()
function: let scrape = async () => { return 'test'; };
node scrape.js
. The word test
should appear in the console. The efficiency of the code, we confirmed the desired value falls into the console. Now you can do web scraping. let scrape = async () => { const browser = await puppeteer.launch({headless: false}); const page = await browser.newPage(); await page.goto('http://books.toscrape.com/'); await page.waitFor(1000); // browser.close(); return result; };
const browser = await puppeteer.launch({headless: false});
headless
parameter to false
. This allows us to observe what is happening. const page = await browser.newPage();
await page.goto('http://books.toscrape.com/');
http://books.toscrape.com/
. await page.waitFor(1000);
browser.close(); return result;
puppeteer
documentation, you can find a method that allows you to simulate mouse clicks on the page: page.click(selector[, options])
selector <string>
is a selector for searching for an element that needs to be clicked. If several elements are found that satisfy the selector, then a click will be made on the first one.Inspect
command.Elements
panel, in which the page code will be displayed, a fragment of which corresponding to the element of interest to us will be highlighted. After that you can click on the button with three dots on the left and in the appeared menu select the command Copy → Copy selector
.click
method and insert it into the program. Here is what it will look like: await page.click('#default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > div.image_container > a > img');
page.evaluate()
method. This method allows you to use JavaScript methods to work with the DOM, like querySelector()
.page.evaluate()
call the page.evaluate()
method and set the value returned to it to the result
constant: const result = await page.evaluate(() => { // - });
Inspect
command.Elements
panel you can see that the title of the book is the usual first-level heading, h1
. This item can be selected using the following code: let title = document.querySelector('h1');
.innerText
property. As a result, we arrive at the following construction: let title = document.querySelector('h1').innerText;
price_color
. We can use this class to select an element and read the text contained in it: let price = document.querySelector('.price_color').innerText;
return { title, price }
const result = await page.evaluate(() => { let title = document.querySelector('h1').innerText; let price = document.querySelector('.price_color').innerText; return { title, price } });
result
.result
constant and output its contents to the console. return result;
const puppeteer = require('puppeteer'); let scrape = async () => { const browser = await puppeteer.launch({headless: false}); const page = await browser.newPage(); await page.goto('http://books.toscrape.com/'); await page.click('#default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > div.image_container > a > img'); await page.waitFor(1000); const result = await page.evaluate(() => { let title = document.querySelector('h1').innerText; let price = document.querySelector('.price_color').innerText; return { title, price } }); browser.close(); return result; }; scrape().then((value) => { console.log(value); // ! });
node scrape.js
{ title: 'A Light in the Attic', price: 'ÂŁ51.77' }
const result = await page.evaluate(() => { let data = []; // let elements = document.querySelectorAll('xxx'); // // // // data.push({title, price}); // return data; // });
const puppeteer = require('puppeteer'); let scrape = async () => { const browser = await puppeteer.launch({headless: false}); const page = await browser.newPage(); await page.goto('http://books.toscrape.com/'); const result = await page.evaluate(() => { let data = []; // let elements = document.querySelectorAll('.product_pod'); // for (var element of elements){ // let title = element.childNodes[5].innerText; // let price = element.childNodes[7].children[0].innerText; // data.push({title, price}); // } return data; // }); browser.close(); return result; // }; scrape().then((value) => { console.log(value); // ! });
Source: https://habr.com/ru/post/341348/
All Articles