3 cases of real estate data analysis. Data Science Week 2017. Review. Part 1

We publish a review of the first day of Data Science Week 2017 , during which our speakers talked about the use of data analysis in real estate.

CYANOGEN

Regarding specific case studies, Pavel Tarasov, the head of machine learning at CIAN , the largest service for renting and selling real estate, publishes more than 65,000 new advertisements per day, among which from 500 to 1000 are fraudulent. The attackers' main goal is to collect as many calls as possible in order to force the client to transfer money to them or, in the case of unscrupulous realtors, to sell some other product.
')
To solve this problem, the company actively applies machine learning using a large number of factors: from the description of the ad to the price, with the most important features being photos. Bright example:

Therefore, it is necessary to apply photo search algorithms in order to detect ads with stolen photos and non-existing apartments. There are 3 main approaches:

Locality-sensitive hashing (aHash, pHash, dHash ...)
ORB / SIRF / SURF Descriptors
Neural Net Descriptors

Perceptive Hash is the most common algorithm for solving such problems, the essence of which lies in the fact that we compress the image to 32x32, for each pixel we consider it brighter than average or not, and then we compare the pictures by the Hamming distance. It is worth noting that in addition to compression, it is also necessary to remove color, brightness and contrast, so that, for example, if the brightness of a photo changes slightly, the pixels do not change their value relative to the average. The algorithm works well in the case of changing colors and brightness, cropped photos, but worse - with turns, which is obvious: after all, then the arrangement of pixels changes.

ORB-descriptors are an algorithm based on the idea of calculating descriptors that allow you to find some key points in the pictures, calculate hashes for them and thus compare the photos with each other using these points:

This approach also works well with “scraps”, works better with turns, but is computationally more complex. The main problem of the algorithm is that it relies very strongly on the geometry of the object: all houses for it will be the same, because they have a triangular roof, several windows, etc., which results in a large number of false positives.

The following algorithm is based on Deep Learning : neural network descriptors — a multilayered, neural network trained on a marked dataset is taken and each image is “run” through it, after which we get a set of numbers for each picture on each network layer. These sets of numbers will be descriptors (as a rule, the last several layers are taken as descriptors).

The problem with neural network descriptors is that in order to train a deep network, it is necessary to have hundreds of thousands of tagged images, several thousand for each class, the houses should differ from each other, etc., but even the fulfillment of these conditions does not guarantee that the neural network will not classify many houses are the same and the same numbers will not turn out to be on the last layers.

Thus, having received a new announcement, we can, using one of the approaches described above, find out whether this photo was published on our service earlier, but the following problem arises: if an ad with this photo is already in our database, this does not always mean that it is fraudulent. For example, builders of typical new buildings throughout the city can use one photo for all their ads. How to be here?

Here neural networks and Transfer Learning come to our rescue again: we take an already trained neural network (for example, GoogleNet) and fix the weights of the layers, except for the last few (depending on our learning strategy). Due to the fact that GoogleNet is trained to recognize "cats" and "dogs" and cannot distinguish a house from a shed, we collect a sample of houses and apartments, mark the data and run them through this trained neural network. As a result, she will be able to recognize what is really in the photo and will distinguish the repetitive layout of the apartments from the truly stolen photos.

The next question is next: we have 2 ads, 2 identical photos, which one is fake? The simplest option is the “first night” rule: the one who posted the photo first is right. It is clear that this is not always true, because, for example, when a tenant changes, landlords can re-use the same photos in a new ad, which can be posted after the fraudster has stolen and laid out her. Another approach is to use machine learning, collecting a sample of pairs of ads with the same photos, but different parameters (price, description, posting time, etc.) and, having learned from this sample, identify fraudsters on all factors at once.

As a result, we have a ready-made pipeline for recognizing fraudulent ads using photos.

DomKlik

He continued the topic of using Data Science in real estate Alexey Grechishkin, director of DomKlik's python development, a service for finding and buying real estate in a mortgage, a subsidiary of Sberbank. Alexey spoke about 3 main areas of the company’s activity, where machine learning is used:

Optimization of the work schedule of employees in accordance with the flow of customers
Transaction Forecasting
Selection of ads for the showcase and moderation

First, the company needs to speed up the processing time of customer requests. The standard is 30 minutes, while in fact the average waiting time is 4 hours. This is due to the fact that the flow of customers is uneven, so the task of planning the schedule of managers is coming to the fore, so that there are more of them during the peak periods and fewer when there are almost no customers. Immediately to the results:

_{Green time series - the actual number of applications per unit of time.} _{Blue - the prediction of the training sample (can not be used).} _{Red - test set prediction.}

In order to obtain such accuracy, the following steps were taken. We take the data for six months and shift them a week ago, so that the current week, for which we just had the data, turned out to be test, then we repeat this procedure 8 times. As a result, the average coefficient of determination for a test sample is 98%, with the exception of the last week, where incomplete and under-processed data are often found, therefore the R-square is lower - 92%.

If we talk about the models that are used in the process, then this is primarily SARIMAX, because it allows to take into account both the seasonal component and exogenous variables (leave, sickness, etc.):

model = sm.tsa.statespace.SARIMAX(table_name[:], exog = Cal[:], order=(1,1,0), seasonal_order=(2,1,0,7), enforce_stationarity = False).fit() model2 = sm.tsa.statespace.SARIMAX(table_name[:], exog = Cal[:], order=(1,0,0), seasonal_order=(2,0,0,7), enforce_stationarity = False).fit() forecast['forecast'] = model.forecast(b, exog = Cal_Pred[:b])*2/3+model2.forecast(b, exog = Cal_Pred[:b])*1/3

Using two models to predict one time series is determined by the fact that the first one works with a number of differences (“How much did the number of requests for the current week change from the previous one?”), While the second one works directly with the previous values of the series. Then the predictions of both models are weighted in a 2: 1 ratio (hand-picked, as is most often done in production), we get the final result.

Forecasting the conversion of the transaction begins from the moment the client submits the mortgage application. We immediately begin to predict whether a person will come to a direct purchase or “fall off” in the middle. At the first stage of model development, we used only dynamic factors: the quality of the manager’s work, the history of his transactions, the region of the transaction, the age of the client, etc. The number of factors was limited, and the models were unstable, so it was decided to add more informative parameters: calls, came to our office, sent some documents, thanks to which we managed to increase the accuracy of forecasts by 30-40% and now we can with 80 % probability to predict whether you make a purchase on the first day of filing your application. At the same time, with each subsequent day, the accuracy increases, in particular, at the last stages of the submission of documents, the accuracy is already 95-99% (an excellent result, given that failures often occur at the last stage). The model was made on xgboost (although tried on CatBoost - "did not take off").

Finally, work with the showcase and moderation of ads are also priorities of the company. On the DomKlik service, only ads from trusted agents are published, however, duplicate photographs, an obscene lecture in the description, etc. can come from them, so it is important to identify and eliminate such phenomena. We also actively use the technology of speech-to-text, deciphering the conversation between the seller and the client, we analyze certain markers: did we agree about the meeting, did we not swear.

In addition, the company is trying to make it easier for customers to search for an apartment: thanks to image type recognition algorithms, the user can filter ads by photo. For example, search for apartments only with photographs of the layout or courtyard.

Airbnb

Eugene Shapiro, a specialist at Airbnb based in San Francisco and a graduate of our Big Data Specialist program, concluded the conversation about the use of machine learning in real estate. Eugene spoke about the scheme for identifying and preventing fraud on the platform.

There are many types of fraudulent transactions made on the platform: account theft, phishing, fake pages and listings, payment of stolen credit cards themselves, spam. Therefore, in order to identify fraudsters, we first need to understand what kind of action the user performs, because the service interface allows you to do the same action in different ways. We begin to collect various information (client actions, “traces” that he leaves, various cookies, etc.) and, by classifying the type of action performed, we include machine learning models that evaluate the likelihood that we can allow this action (in -flow evaluation). If this probability is not high enough, then we suggest the user to provide some additional data to make sure that he has no bad intentions (verification by phone or email). For example, if you entered your account from Guyana, then most likely it’s not you (although we’re not 100% sure, this is verification).

In case if for some reason the system missed an undesirable action, then an out-of-flow evaluation is applied — the ML models check the data stores for fraudulent actions that have already been performed, which must then be “cleaned up” as soon as possible. For example, if someone created 1000 accounts with the same picture, we can identify them and massively eliminate them. Account suspension also applies here: if we are sure that an account has committed strange actions against other users, then we block it.

If we take a closer look at the in-flow evaluation, then all customer actions are evaluated by a rules engine with the name Kyoo (this is how the judge was called at StarTrek), who, collecting data from various sources, evaluates events from the point of view of a set of simple rules (for example, if you logged in from several IDs immediately, something is wrong) and assigns each one of the labels: the account is stolen, payment by a stolen card, etc. Kyoo was written on Scala like Facebook Haxl.

Speaking of data sources, it is worth noting that, simply referring to different APIs, the data will not be very interesting (account status, listing), although, as a rule, more aggregated metrics are important for risk assessment: what frequency does the site take? speed of loading the page, how many payments we saw with the client’s credit cards. These parameters are calculated continuously, it is impossible to aggregate them from the databases, so the best option would be to precompute a certain number of such signals and put them into the repository of the “key-value” type so that by the time some action is performed, we already had a large amount of information about the user.

However, here we have a problem of architecture: often for making a decision, we need a signal not only on historical data, but also at the current moment (more precisely, at the end of the current day). Thus, in Airbnb, the following implementation of a 2-part lambda architecture: the first is responsible for the offline signals that can be calculated in Hive, there you can do arbitrarily complex calculations as long as they fit into 24 hours, and the second is real-time events that go through Kafka in real-time aggregation.

As a result, we obtain a stable data processing pipeline that responds to requests as efficiently as possible. For example, how many payments for the last 7 days on this credit card happened? In fact, this signal is a combination of what we know: the number of payments for the last 30 days (from 1 part of the architecture) and for the current day (from 2 parts). This approach allows us to simultaneously train models, and to scale them based on real data.

The partner of Data Science Week 2017 is MegaFon, and the info partner is Pressfeed.

Pressfeed - A way to get free publications about your company. Subscription service for journalists inquiries for business representatives and PR specialists. The journalist leaves the request, you answer. Sign up. Have a good job.

Source: https://habr.com/ru/post/339116/

All Articles

3 cases of real estate data analysis. Data Science Week 2017. Review. Part 1

CYANOGEN

DomKlik

Airbnb

More articles: