Last semester, students of the Computer Science Center Sergey Gorbatyuk and Peter Karol worked on the deduplication of ads on Yandex.Real Estate under the leadership of Vladimir Gorovoy, the project manager. The guys told us how the project works, and what results they have achieved.

Project Task
Yandex. Real Estate is a service of announcements about the sale and rental of apartments, rooms, houses, sites. Advertisements are placed by individuals, developers or agencies, so the same property is often represented by several offers. Most often, several agencies are trying to sell the apartment at once, and sometimes also the owner.
')
Duplicate scanned ads are annoying to users at best, misleading at worst. It also prevents the Yandex team from collecting analytics for apartments and counting how much is being sold or sold. Therefore, I want to learn how to search and glue duplicates in one offer.
The flow of ads can not be moderated manually, because it is huge. So, we need an algorithm that is able with high accuracy to find as many duplicates as possible. Accuracy is important because the cost of the error is high: sticking together different ads will cause users to complain.
Tasks with such high requirements and a complex data structure are traditionally solved using machine learning algorithms, therefore, in reality, the task was formulated as “Training one of the state-of-the-art classifiers”.
Problems
- The subject area is new for us, it has its own difficulties and features.
- Marked data is not at all.
- There is no explicit machine learning task — what will be the factors and target variables here?
With the last point, everything is relatively simple: the factors will be information about a pair of objects from different declarations, and the target variable is whether it is one object in reality, or two different. But finding out the characteristics of the real estate market and the layout of the data took up most of the project time.
Data markup
We received a part of the database with offers for the sale of apartments in Moscow. The basic data that describes them is:
- General structured data - metric area, price, floor, number of rooms, bathrooms, ceiling height, meta-information about the seller and others.
- Text description of the object.
- Photos of the object.
In Yandex, we had a duplicate classifier trained on the factors from 1 point without control data. This is an offer clustering algorithm that called duplicate offers that fall into one cluster. It had rather high accuracy, but rather low completeness. This means that the proportion of duplicates that he found was low, although he was rarely mistaken.
We used the idea of ​​comparing offers to each other on the basis of differences and ratios of key indicators: for example, prices or floors, in order to get an empirical metric of the dissimilarity of ads. And they came up with a function that compared the two numbers with a single number - the measure of how two ads differ in the primary data. This indicator helped us in creating data to create a balanced sample and at least roughly regulate the distribution of examples: we want more of the same, or very different, or complex examples somewhere in the middle.
The markup turned out to be a much more difficult task than we thought. And that's why:
- Identical and non-informative descriptions of similar objects. Especially from the new fund: development companies put them in batches, and only in rare cases can they be distinguished by the lot number.
- Intentional data corruption. Real estate specialists explained to us that sometimes people want to hide the real floor or the appearance of the apartment.
- Not informative exterior or similar photos of objects.
- Different photos of the same object. Below is one of the simple examples, but some photos have to stare for a long time like a detective, using all the power of the deductive method for the sole purpose of deciding whether it is one apartment or two different ones.


Supervised baseline
We marked out the data and tried to train Random Forest only on the factors from the first item - categorical and continuous indicators of price, square footage, etc. The predictors were differences and relations of these factors, as well as additionally constructed factors based on the time of placement and update, information about the seller, etc. On the test data, this classifier was more precisely the conservative clustering algorithm by 5-8%, and its completeness exceeded the previous result by 30-35%.
Encouraged by this result, we turned to two other factors - the textual description and pictures. We almost failed to work with the latter: we unloaded them rather late. We tried to use hashes for screening out common exterior photos, perceptual hashes to combat water marks, and exits of high layers of convolutional networks (ResNet-18) as additional factors, however, to our surprise, we did not get a strong increase in accuracy.
In our opinion, the analysis of images in this subject area needs to be approached even more thoroughly, to pay a lot of attention to image preprocessing, try other architectures and special loss functions. The Tf-Idf vectorization algorithm was applied to the lemmatized and vectorized text data and the vectorized representation was used as the primary features. Different metrics over these vectors gave a more impressive increase in the quality of predictions. The best result as a factor was given by the probability, predicted by the logistic regression separately trained on these vectors.
Final model
The final model, which aggregated all the signs and exits of others, was CatBoost. This is a Yandex product trained with a special loss function - a modified F-measure. CatBoost technology has established itself as one of the best in the classification task and is easily integrated into the infrastructure. The quality of the algorithm on the test sample is 98% accuracy and 93% completeness.
We consider this a good result, and whether it’s the same from a business point of view will be decided by marketing experts :)