📜 ⬆️ ⬇️

Load distribution when parsing websites and connecting additional cloud resources

In this post we will talk about the library, which registers the nodes and redirects requests from the outside to a specific node.

How did you get the idea to write this project?


After the need to parse sites in large quantities, I tried to implement such a thing with the help of the selenium grid, then I took selenoid. The selenoid came up, but there was a lot that I didn't need, such as browser versions and options, and also, most importantly, this lack of auto scaling (but selenoid is not for this). 90% of the time the cluster is idle, and then there is a large load that the server cannot cope with. It turns out large costs for iron, which almost all the time does not work, but still can not cope. I thought it would be great if, as the load arrived, the number of executable browsers would increase, and how the load disappears and the browsers are removed. Fortunately, this can be implemented, for example, through AWS EC2 .


Little about the structure



Option number 1. Request from the site


The node makes a request to the hub, with the token header set — which is the token from the environment variable. The hub checks the token from the request, and if they match, it remembers it. The hub starts pinging this node every 4 seconds. If 5 ping attempts fail, the node is deleted with a loss of connection. The node, in turn, initiates a response ping, once every 10 seconds, in case the connection with the hub was lost. This is done so that after the connection is broken, the cluster itself restores its condition.

Option number 2. Request from user


The user makes a request to the hub with the token and number headers installed. The token is needed so that only trusted nodes can exploit the cluster, and number so that we can create different sessions within the same client ip. Each session has its own unique number.

For each request, the hub checks whether there is an already created route or not, if there is - the request is simply redirected to the desired node, if there is no such route - the request from the user enters the queue to release the node. As soon as one of the nodes is released, the hub constitutes the route for the user session and the freed route. Now all requests for this session will go to a specific node.

A minute later. how the user closed the connection - the node is released and transferred to another user request.

Link to the project repository

Total


The post turned out to be more similar to the instructions for use, but nevertheless, I believe that this project may be useful.

PS Some clarifications


This is the first project that I started writing on GOLANG, in connection with which, if someone has any suggestions or comments - please write in the comments (I don’t even count on PR, but it would be super cool!)

Source: https://habr.com/ru/post/420303/


All Articles