Distributed computing for housing search

Everyone has heard of distributed computing projects that are trying to solve large-scale tasks, such as the search for extraterrestrial life, drugs for AIDS and cancer, the search for primes and unique solutions for Sudoku. All this is very entertaining, but no more than that, since there is no practical use for a person who shares the resources of his computer.

Today I will talk about distributed computing that solves your problems. Well, not all of course, but only some related to the search for housing. I recently wrote about the Sobnik project, an extension for Chrome that detects middlemen on bulletin boards. Two weeks ago, a new version of the program was launched, in which work on scanning and analyzing ads is distributed among users' computers. Since that time, about a million ads from over a thousand Russian cities have been processed, and this is only the beginning. Details, technical details and some more numbers await you under the cut.

')

Why is distributed computing here?

In order to identify the intermediary, the program needs to load the advertisement, parse it, analyze the photos, add the results to the database and perform several types of search, in particular - by phone numbers. Initially, the user was required to open a specific ad in order for the plugin to perform this analysis. This is inconvenient, because there are hundreds of ads, and among them almost all are intermediaries. It was necessary to get rid of manual operations, as computers exist in order to act automatically!

The first plan was to create a centralized ad scanner (hereinafter referred to as the crawler ). Take PhantomJS , finish the plug-in JS-code for it, and run the scan on the server. Avito , of course, will be angry at a large-scale scan, and will block the server's IP, but in that case it was decided to buy IP addresses or use a bunch of proxies (Avito is the only bulletin board with which the plugin works). But the longer I worked and thought in this direction, the more I understood - the decision is bad:

Servers cost money that I did not want to spend - the project is free, and I dream of leaving it as such.
There is no reliability - there was no desire to compete with Avito in search of non-banned proxies.
The scalability of such a decision rested on even more money, see item 1.

Soon, I came to the conclusion that the only solution without these shortcomings would be an automatic crawler built into the plugin, and working in users' browsers even when they are not busy searching for housing.

Distributed crawler

Only a small percentage of users work actively with Avito at any given time, however, the browser with a working plugin runs almost all day. What if in the browser of each user would be scanned, say, one ad per minute? These total capacities would be enough to quickly scan the ads with which active users are currently working. And Avito would not be upset, because overloads from certain IP would never occur. Yes, probably not everyone will want to share the power of their computers, but I believed that "conscious citizens" would still be enough. The idea looked very tempting.

Delving into the technical part of the new approach, I realized that the current server simply could not cope with the increased load. Photo processing to detect phone numbers already consumed almost the entire CPU, and if the flow of ads increases by an order of magnitude? AWS Free Tier's free virtual server had to cope, otherwise the project’s life was threatened. So the first step was to transfer the work with images to the scalable part of the system - to the client. The code of the plugin is open , and of course there were fears that the publication of the algorithm would make it difficult for evil spammers to find weak points and ways to deceive. However, I did not see any other options for the development of the project, and decided - come what may.

So, now Sobnik opens a separate tab in your browser. Periodically, about once a minute, he requests a task from the server - the ad address to scan, loads this ad in his tab, parses, processes the images and sends the results to the server. Scanning tasks are created when one of the users opens a list of ads - in advance and nothing is scanned for nothing, ads are perishable.

Conscious citizens who wish to share the power of your PC, while not so much - about a couple of hundred. However, this is quite enough to process 60 thousand ads per day, providing the needs of thousands of users of the plugin. The list of hundreds of ads opened on Avito is usually processed within one minute.

It all sounds pretty simple, but I still fascinated by the ability to use hundreds of other people's computers to solve a common problem. The browser is the ideal platform for these kinds of applications:

JavaScript can do almost everything now.
The JS code is cross-platform (with the exception of calls to the browser API).
The extension is installed in one click.
Webstore distributes software updates.
The browser is running most of the day.

If you have invented your own SETI @ HOME - take a closer look, maybe you need to learn the API of your favorite browser. Many have already written about this, but the heyday of distributed computing, solving the practical problems of volunteers, and not of abstract human civilization, is not yet visible.

It is worth paying attention to one nuance - people are very annoying when tabs open in the browser on their own. And if it is necessary to open them, it is worth spending enough effort to ensure that this behavior was expected for users (instructions, faq, understandable informational messages). Taking this opportunity, I apologize to everyone who was surprised by the behavior of the plug-in after the release of the new version of the program - I should have spent more time informing you about such significant innovations.

Efficient server

A server that coordinates work in a distributed computing system must be very efficient, because it is not so easy to scale. The Sobnik project server is engaged not only in coordinating, but also in solving an applied task — searching the database, so the requirements for it are even higher. And if the client-side functionality was just slightly expanded, the server was almost completely redone (remember, it is written in Go).

Initially, MongoDB was responsible for storing data and performing searches, now all working data is stored in RAM. Two processes are running on the server - front-end and backend. The frontend interacts with clients, maintains a queue of tasks for a distributed crawler, a queue of announcements for transfer to the backend, and a small cache with information about the status of the announcements (intermediary / owner). The backend stores data in RAM (a squeeze from incoming ads), performs searches on them, asynchronously writes incoming ads to MongoDB, asynchronously executes frontend queries. Interact processes with ZeroMQ .

MongoDB is used in the most efficient modes - for consistently writing incoming announcements, and for sequential reading when restarting the backend, in one stream. At start-up, the backend reads the current advertisements from the DBMS, performs the same searches and analysis, as with regular receipt of data from customers, and fills its database in RAM. This approach allows you to easily modify the algorithm for identifying intermediaries, and apply it to the entire database by simply restarting the backend.

While the backend is running, filling its database, the frontend continues to serve clients — incoming messages are put in a queue, requests for the status of ads are served from the cache, and tasks are distributed from their turn.

In a system with almost a dozen parallel threads ( goroutine ), there is not a single mutex. Each stream has exclusive access to its data, and serves other streams through channels (their presence in Go just makes you feel incredible).

All interprocess and inter-thread communications are performed in non-blocking mode. If the receiving party does not have time to read, the data is either added to the queue (incoming ads on the frontend) or discarded (other requests). All queues and cache have a size limit, and when filling out the oldest records are discarded - the latest information is more valuable to Sobnik's clients. If it is impossible to execute the request (the data channel is busy), the corresponding response is sent to the client, and the client tries again, at an increasing interval.

Such an architecture was created in order to ensure maximum reliability and availability of the service. And for now, it justifies itself - having nearly a million ads in the database and processing up to a million http requests per day, load_average on a virtual server stably keeps below 10%.

We are looking for housing in a new way!

Sobnik successfully survived the rebirth - “distributed” and fully strengthened. A month ago, it was just an interesting idea, a prototype, which many set to indulge. Now it is a working tool that is convenient to use and which really helps to filter spam. This has become possible thanks to you, Dear Users, because your computers are solving a common task.

Of course, the system works far from perfect, there are many questions to the detector for photos, and there are problems with the analysis of the text of ads. However, it is now clear that Sobnik has enough resources to resolve these issues. It remains only to feed the enthusiasm of the author, what you can achieve with criticism, suggestions and valuable comments.

Source: https://habr.com/ru/post/240941/

All Articles

Distributed computing for housing search

Why is distributed computing here?

Distributed crawler

Efficient server

We are looking for housing in a new way!

More articles: