📜 ⬆️ ⬇️

Search for housing without intermediaries in the 21st century

I guess we all once looked for housing. Someone is in the property, most probably in rent. Anyone who has ever tried to find real offers on bulletin boards, they know - this is unrealistic. This amount of spam is not, perhaps, in any other field. After you plunge into this hell, usually the hands begin to itch to apply their IT-tness for the good of their fellow man. The result for me was the Sobnik project, which I want to talk about.

Sobnik is a Chrome plugin that marks resellers on bulletin boards. While working only with Avito.ru , in the near future I will add Irr.ru and other large boards. Anyone who sits on their suitcases and who can not wait to try, I ask in the Google Web Store . Under the cut, I will talk about the technical side of the project, about its prospects and about my observations of the enemy by intermediaries. Fans of criticizing someone else's JS code are also well-known, the source code of the client part of the plugin is available on github .


For fans of accuracy, I clarify: formally, Sobnik is an “extension” and not a plug-in, but I was too used to the last term.
')

Why all this?


I hope that the “benefit to society” is obvious, so I will immediately turn to the question “why is it personally for me”. Faced with the last time looking for housing, spitting on spam, which filled the Internet, having seen enough of inventive realtors, I felt a prick of conscience. How can ships already plow through the space of the 21st century? Are we, Programmers, incapable of coping with pathetic spammers?

Upon reflection, I ventured to suggest that they could. Looking at a few hundred ads was enough to understand - intermediaries are easy to identify. Either by the content of the ad, too suspicious or obviously agency, or by the presence of many offers with the same phone number. It remained to choose the technology on the basis of which this idea could be checked - the ads needed to be parsed, saved somewhere, and analyzed. I chose Google Chrome as a parser - to access all the necessary information on bulletin boards, a full-fledged browser engine with a working JavaScript is required. For server matters I decided to try Go and MongoDB. All three things were new to me, so it was a great opportunity to expand horizons and learn something new. The result was Sobnik.

How to identify agents?


At first glance - quite simple. An accessible and reliable indicator is the telephone number for which there are many announcements. After all, the agent will not buy a new SIM card for each announcement! In addition, some ads contain direct mentions that the author is a realtor and wants a commission. In theory, it is of course simple, in practice I had to solve many small issues:
  1. Avito and many other boards publish a phone number in the form of an image, respectively - the number has to be recognized.
  2. Agents are actively hiding their real phones. Phone indicate in the ad text, words, letters, special characters. All this disguise has to be revealed and opened.
  3. Some owners give a lot of ads on the same apartment. In order not to enroll them in realtors, we have to find out about different objects in different ads or the same thing. I did not begin to get involved in address recognition, I use ready geographic coordinates, available on many boards.
  4. The most advanced intermediaries draw their real phone numbers in the photos of apartments. Such comrades most difficult to identify. I did not find a reliable and easy-to-use OCR solution capable of recognizing numbers on photos. I had to poke and give birth to a simple algorithm that determines whether there is any text in the photo, and consider such ads as agency.
  5. In the ad text there is often a direct mention of the fact that the author is an agent. However, since computers have not yet learned to understand speech, I did not invent a reliable method for the full use of this information. So far, it has managed to detect some of the most common and unambiguous phrases, since this criterion only complements the main detector by phone numbers.

The use of these techniques allows you to automatically identify most of the intermediaries. This is what Avito looks like during the activity of spammers (red and green circles are the result of Sobnik's work):
image

Technical side of the project


The plugin is written in JavaScript, since the Chrome API functionality is quite enough for the goals. The only difficulty was getting the image of the phone number. The fact is that Avito gives it only for requests with the correct Referer. There is no possibility to falsify this header in the browser, and to get the image data downloaded from the Avito page does not provide the Cross-Origin Policy . It turned out that this protection is easy to get around - I save the page in MHTML format through the appropriate API , and then from the resulting string I cut out the piece I need, which contains the image in base64-encoding. In the same way I get access to photos of apartments.

Further, the information is sent to the server where the program runs on Go. In fact, two programs - all requests are executed asynchronously, one program writes all requests to the queue, the second program processes these requests. Logic is built into the client part to slow down the flow of calls to the server if it does not have time to execute queries on time. Such an approach will allow smoothing the load jumps (I hope very much that they will arise today). Data is stored in MongoDB.

I’ve put all this stuff on Amazon AWS (another thing I wanted to try). While “Free Tier” is enough, so I don’t pay for hosting.

The server API is publicly available, no authorization. I suspect that there will be those who want to indulge and spoil, so in the near future - to introduce some kind of protection. In the end, I’ll almost certainly come to register the users of the plugin, but for now I don’t want to add extra barriers for those who want to try.

The source code of the plugin is open . First of all, you can't hide it anyway. Secondly, what kind of information the plugin collects is immediately visible, so that understanding people will have no questions about privacy. And finally, suddenly one day there will be enthusiasts wishing to participate in the development.

There is no centralized crawler for data collection. Firstly, Avito cuts off IP-Schnick, which open the order of a couple of hundred pages per hour. Secondly, I hope that when there will be a lot of users, a distributed crawler will turn out - everyone will open a couple of ads, and that’s the base is full. However, while there are no active users, the database is empty. The main benefit of the plugin is that you do not need to open agent ads, and if the database is empty, then you will have to open everything. In general, in order to give the system at least some acceleration, I made another plug-in for internal use, which quietly, approximately one page per minute, scans Avito offers for renting apartments in Moscow. Keeping up with spammers during peak hours does not work, but still you, dear reader, will have the opportunity to evaluate how Sobnik works: installed, open the above section on Avito and enjoy. I would be happy to offer suggestions on how to organize Avito scanning on a more serious scale. Those who wish can issue a plugin for crawling, if you suddenly want to help the project or scan another city or section.

Realtor Observations


By running a rental scan in Moscow, I made some useful observations. All of them are quite logical and seem obvious, but Sobnik allowed them to visually check and confirm:
  1. On business days, about 80% of ads belong to agents. Avito, by the way, actively bans a lot of ads, so out of 30 announcements per minute, an hour remains from the strength of 10. However, of these ten, the vast majority are still intermediaries.
  2. Late in the evening (after 10-11 hours), and on weekends - there are almost no agents. Rest to see the heavy spamming days.
  3. Paid ads (on Avito, they are highlighted in yellow) - almost always the owners. So far I have seen only one agent who has not regretted a hundred rubles for advertising an elite apartment. There is a possibility that it was the owner, who decided to pretend that he was an agent with an exclusive and cut down extra money (there are such, judging by rumors).
  4. If there are only one or two photos in the ad, this is almost certainly an agent. Three photos - 50 to 50. Owners either write without a photo at all, or if they are tensed up - they make at least fives.
  5. If the phone is listed on the photo, or “encrypted” in the ad text - this is almost certainly an agent. Encrypted in a similar way makes them Avito, which requires money for placing a large number of ads on the same phone number.

This list, in general, allows your eyes to filter almost all the garbage, so if you are too lazy to put Sobnik - use.

Disclaimer: I'm not against realtors. For them, on Avito, if there is a special check mark, you put it - and everyone immediately understands that you are an agent. And of course I am aware that in many cases the agent is simply necessary. Sobnik fights only those who are spamming and trying to deceive you.

Perspectives


I plan to develop the project in two directions:
  1. Add new boards (the next one will probably be “Hand to Hand”).
  2. Improve the accuracy and reliability of the detector.

Theoretically, when many boards will be actively scanned, Sobnik will be able to find the original advertisement of the owner from his copies published by agents on other boards. Whether it will be possible to reach these heights will show time, and of course your valuable comments.

I do not plan to publish the collected ad database, it would be too brazen to steal and distribute this information. However, since Avito’s financial plan does not allow them to filter spammers themselves, Sobnik will take care of this.

Your wishes and suggestions will be very happy.

UPDATE:


Since October 10, the problem of filling the database is solved - the installed plug-in in a separate tab automatically scans ads that are currently required by users. In fact, now Sobnik is a large computer network where each node works for a common cause. Thus, any clean list of ads for any region is processed in a couple of minutes. Thanks to everyone who offered his help, free servers, IP and Internet channels, your desire to help me very much makes me happy. However, now Sobnik copes with this itself.

Source: https://habr.com/ru/post/237869/


All Articles