📜 ⬆️ ⬇️

Add-on to Avito. Startup or architecture example



Prehistory


To be honest, I have never bought used items on bulletin boards before. But when another economic crisis happened and the need made, I had to turn my attention to Avito.

When looking at the offers, it immediately caught my eye that some of the ads look dubious for a number of signs: a low price, an inaccurate description, etc., according to which it seemed that they were selling something other than what they were selling, or they did not know what they were selling. An image from the past, a flea market of the 90s, from which it was possible to return with both purchases and empty pockets or a cut bag, immediately appeared in my memory.

As you know, Avito does not provide information about other ads of the seller, so the ad text often uses keywords or hashtags that the buyer can use to search. But it should be noted that the Avito administration does not welcome the presence in the ad text of any information that is not relevant to the sale item, as stated in the placement rules. If the placement rules are violated, Avito can block both the ad and the merchant account.
')
The search engine gave me a few sites, parasitizing on Avito and other bulletin boards, mainly related to used cars. Most of them offer to search ads by phone number of the seller in the database of pre-downloaded ads. All of them have shortcomings that are not important for me: an intricate interface, a paid service, an update with a certain lag.

Sharing Steven Levy’s opinion on free access to information, it was decided to analyze Avito to develop its own sravnito.ru service with blackjack and all the tasks, namely, with a simple interface and free access.

Based on the analysis, the main attributes of the ad were identified:


Architecture


Now directly about the architecture of the software package, on which the search service of all the ads of the seller is based:



Storage


MySQL (Maria DB) with the InnoDB engine is used as storage for storing ads and their screenshots.

DB1 stores ads with basic attributes. To reduce the amount of memory occupied by the data, the VARCHAR type is used for text fields, as their length varies from ad to ad. About half a million lines are added every day, among which are the ads themselves, as well as logs and service information. With this dynamic, the storage can rightfully be attributed to Big Data. Of the features of the settings, you can select the following parameters:

max_heap_table_size = 512M innodb_buffer_pool_size = 3G 

The tables in the heap are used to optimize queries from several large tables when first the data is selected in a temporary table:

 CREATE TEMPORARY TABLE _temp_table ENGINE=MEMORY AS ( SELECT field FROM table WHERE key = i_key LIMIT i_limit); 

which then connects with another:

 SELECT table.* FROM table JOIN _temp_table ON table.field = _temp_table.field; 

The DB2 screenshots of the ads are stored as they appear to the browser user. Before recording, the screenshots are compressed in JPEG with quality = 5, which provides an image file size of approximately 20Kb. It is believed that with a BLOB size of no more than 200Kb, the storage performance of files in MySQL is not inferior to NoSQL-storage, which allows you to remain in the comfort zone of a relational database with all its advantages. Such a compressed screenshot, despite the minimal image quality, allows the user to make sure that the specified ad actually existed, and to see at least a schematic image of the product.

All logic is implemented in stored procedures to encapsulate code-dependent code in the DBMS itself. Thus, DBMS clients have authority only to access stored procedures that are idempotent. As an additional plus, we get the lack of the ability to implement SQL injections.

Storage API


The M1 server is implemented as a golang microservice and provides a RESTful API for saving announcements, screenshots, and also for reading data displayed on the advertisement search service page. There is no reason to use any frameworks or external libraries to implement RESTful on golang, so only standard libraries are used, except for one:

 import ( "database/sql" "encoding/json" "net/http" _ "github.com/go-sql-driver/mysql" ) 

GET and POST requests from clients are processed and the corresponding stored procedures of DB1, DB2 are called.

Ad loaders


Ads are downloaded from the Avito website by the S (1) -S (N) downloaders, written in Java using the Selenium WebDriver library. After receiving the ad attributes and a screenshot from Avito, the loader contacts the M1 server for data transfer. Also, feedback is implemented to control the bootloaders, who periodically poll the server M1 for commands, such as “stop”, “start”.

Captcha


To solve the captcha that Avito sometimes requests, there is a C1 server, similar to the M1 server implemented in golang and providing a RESTful API. Solving captcha is carried out in two ways:


To communicate with rucaptcha.com, their API is used. For manual guessing, an Android application written in Java is used, which displays the image and accepts the answer. The C1 server decides whether to redirect captcha to rucaptcha.com or to the Android application, depending on the number of requests that have accumulated in the queue. Having received a captcha solution, the C1 server sends the response to the loader that has requested it.

Monitoring


Similar to the Android application for solving a captcha, there is an application for monitoring. The Android application for monitoring accesses the server M1, which in turn refers to the database where the logs are aggregated, on the basis of which one can judge the number of downloaded ads, malfunctions, etc.

Conclusion


Further development of the service may be as follows:


In this and in another case, only loaders that are easily integrated with servers M1 and C1 are subject to development. Refinement of other parts of the system is not required.

If you make the API of servers M1, C1 public with authorization and add a key field to the database structure for dividing by clients, you can provide a service for storing data and processing captcha as SaaS. Client data can be stored in the BLOB in the form of JSON, while finalizing the database with stored procedures and the API.

What can be said about the selected technologies:


I would appreciate comments and discussion.

Source: https://habr.com/ru/post/317598/


All Articles