
Prehistory
To be honest, I have never bought used items on bulletin boards before. But when another economic crisis happened and the need made, I had to turn my attention to Avito.
When looking at the offers, it immediately caught my eye that some of the ads look dubious for a number of signs: a low price, an inaccurate description, etc., according to which it seemed that they were selling something other than what they were selling, or they did not know what they were selling. An image from the past, a flea market of the 90s, from which it was possible to return with both purchases and empty pockets or a cut bag, immediately appeared in my memory.
As you know, Avito does not provide information about other ads of the seller, so the ad text often uses keywords or hashtags that the buyer can use to search. But it should be noted that the Avito administration does not welcome the presence in the ad text of any information that is not relevant to the sale item, as stated in the placement rules. If the placement rules are violated, Avito can block both the ad and the merchant account.
')
The search engine gave me a few sites, parasitizing on Avito and other bulletin boards, mainly related to used cars. Most of them offer to search ads by phone number of the seller in the database of pre-downloaded ads. All of them have shortcomings that are not important for me: an intricate interface, a paid service, an update with a certain lag.
Sharing Steven Levy’s opinion on free access to information, it was decided to analyze Avito to develop its own sravnito.ru service with blackjack and all the tasks, namely, with a simple interface and free access.
Based on the analysis, the main attributes of the ad were identified:
- title
- publication date
- price
- location
- phone number (either a number image, or a hash from it, or a recognized OCR value)
Architecture
Now directly about the architecture of the software package, on which the search service of all the ads of the seller is based:

Storage
MySQL (Maria DB) with the InnoDB engine is used as storage for storing ads and their screenshots.
DB1 stores ads with basic attributes. To reduce the amount of memory occupied by the data, the VARCHAR type is used for text fields, as their length varies from ad to ad. About half a million lines are added every day, among which are the ads themselves, as well as logs and service information. With this dynamic, the storage can rightfully be attributed to Big Data. Of the features of the settings, you can select the following parameters:
max_heap_table_size = 512M innodb_buffer_pool_size = 3G
The tables in the heap are used to optimize queries from several large tables when first the data is selected in a temporary table:
CREATE TEMPORARY TABLE _temp_table ENGINE=MEMORY AS ( SELECT field FROM table WHERE key = i_key LIMIT i_limit);
which then connects with another:
SELECT table.* FROM table JOIN _temp_table ON table.field = _temp_table.field;
The DB2 screenshots of the ads are stored as they appear to the browser user. Before recording, the screenshots are compressed in JPEG with quality = 5, which provides an image file size of approximately 20Kb. It is believed that with a BLOB size of no more than 200Kb, the storage performance of files in MySQL is not inferior to NoSQL-storage, which allows you to remain in the comfort zone of a relational database with all its advantages. Such a compressed screenshot, despite the minimal image quality, allows the user to make sure that the specified ad actually existed, and to see at least a schematic image of the product.
All logic is implemented in stored procedures to encapsulate code-dependent code in the DBMS itself. Thus, DBMS clients have authority only to access stored procedures that are idempotent. As an additional plus, we get the lack of the ability to implement SQL injections.
Storage API
The M1 server is implemented as a golang microservice and provides a RESTful API for saving announcements, screenshots, and also for reading data displayed on the advertisement search service page. There is no reason to use any frameworks or external libraries to implement RESTful on golang, so only standard libraries are used, except for one:
import ( "database/sql" "encoding/json" "net/http" _ "github.com/go-sql-driver/mysql" )
GET and POST requests from clients are processed and the corresponding stored procedures of DB1, DB2 are called.
Ad loaders
Ads are downloaded from the Avito website by the S (1) -S (N) downloaders, written in Java using the Selenium WebDriver library. After receiving the ad attributes and a screenshot from Avito, the loader contacts the M1 server for data transfer. Also, feedback is implemented to control the bootloaders, who periodically poll the server M1 for commands, such as “stop”, “start”.
Captcha
To solve the captcha that Avito sometimes requests, there is a C1 server, similar to the M1 server implemented in golang and providing a RESTful API. Solving captcha is carried out in two ways:
- using the service rucaptcha.com
- manually in the application for Android
To communicate with rucaptcha.com, their API is used. For manual guessing, an Android application written in Java is used, which displays the image and accepts the answer. The C1 server decides whether to redirect captcha to rucaptcha.com or to the Android application, depending on the number of requests that have accumulated in the queue. Having received a captcha solution, the C1 server sends the response to the loader that has requested it.
Monitoring
Similar to the Android application for solving a captcha, there is an application for monitoring. The Android application for monitoring accesses the server M1, which in turn refers to the database where the logs are aggregated, on the basis of which one can judge the number of downloaded ads, malfunctions, etc.
Conclusion
Further development of the service may be as follows:
- Creation of loaders for other bulletin boards, which will allow you to cross-search all the ads of one seller
- Using the complex to process other data, for example, downloading from social networks to search for all posts by account name
In this and in another case, only loaders that are easily integrated with servers M1 and C1 are subject to development. Refinement of other parts of the system is not required.
If you make the API of servers M1, C1 public with authorization and add a key field to the database structure for dividing by clients, you can provide a service for storing data and processing captcha as SaaS. Client data can be stored in the BLOB in the form of JSON, while finalizing the database with stored procedures and the API.
What can be said about the selected technologies:
- MySQL - a classic of the genre, no comment
- Java - loaders can run on coffee makers and refrigerators
- golang - microservices are developed very quickly, easily deployed by copying a single binary to the server
I would appreciate comments and discussion.