📜 ⬆️ ⬇️

Web crawler using Python and Chrome

Good afternoon dear friends.

Recently, sitting on the couch, I thought that I wanted to make my own spider that could download something from web sites. But to download it should not be a simple download, but like a real nice kind browser (i.e., JavaScript to execute).

Such interesting things came up in my head like Selenium, PhantomJS, Splash and all that. All these things were a bit of a drag on me. Here are the reasons I identified:
')

Actually, to the point. Recently released Headless Chrome . This means that now we already use chrome as a crawler (but this is inaccurate). However, I did not find normal utilities for using it as a crawler. I found only chrome-remote-interface from the list of third-party clients (the rest were extremely boring and completely incomprehensible at first glance).

Having run through the documentation and the finished project, and making sure that no one really implemented the client under Python, I decided to make my client.

The protocol in Chrome Remote Debug is quite simple. First we need to start Chrome with the following parameters:

chrome --incognito --remote-debugging-port=9222 --headless 

image

We now have an API available at http://127.0.0.1:9222/json/ , in which I discovered methods such as list , new , activate , version , which are used to manage tabs.

Also, if we just go to http://127.0.0.1:9222/ , then we can switch to a great web debugger that completely simulates the standard one. It is very convenient to keep track of how chrome’s apish methods work (the debugging window on the right is emulated inside the window, and the browser window is drawn on the canvas).

image
Actually, by going to the list tab, we can find out the address of the web socket, with which we can communicate with the tab.

Then we connect via the web socket to the desired tab, and communicate with it. We can:


After days of tormenting the writing of a functional for myself, I had such a library .

What is in it:


image
This is what the prog looks like, which loads the page and gives the length of each response to the request:

 import asyncio import chrome_remote_interface if __name__ == '__main__': class callbacks: async def start(tabs): await tabs.add() async def tab_start(tabs, tab): await tab.Page.enable() await tab.Network.enable() await tab.Page.navigate(url='http://github.com') async def network__loading_finished(tabs, tab, requestId, **kwargs): try: body = tabs.helpers.unpack_response_body(await tab.Network.get_response_body(requestId=requestId)) print('body length:', len(body)) except tabs.FailReponse as e: print('fail:', e) async def page__frame_stopped_loading(tabs, tab, **kwargs): print('finish') tabs.terminate() async def any(tabs, tab, callback_name, parameters): pass # print('Unknown event fired', callback_name) asyncio.get_event_loop().run_until_complete(chrome_remote_interface.Tabs.run('localhost', 9222, callbacks)) 

Here we use the Kolbek system. The most interesting: start and any :


My code seemed to me quite elegant, and I will use it further, even though the protocol is a bit damp. If someone wants to discuss, then write here, in Github or mail me.

However, there was one thing but because of which I still cry. With the help of remote API it is impossible to intercept and modify requests and responses. As I understand it - this is possible through mojo , which allows chrome to be used as a library.

However, I thought that compiling unstable chrome and the lack of a Python interlayer would be a great misfortune for me (now there is C ++ and JavaScript in the process of development).

Hope the article was helpful. Thank.

image

Source: https://habr.com/ru/post/328800/


All Articles