Introduction
Predictable, but such a long-awaited change of seasons is happening right now. Many of your friends are looking forward to the beginning of the summer season and are actively updating their inventory. The list of very necessary things that you need to buy exceeds all imaginable budgets for ten years ahead (because you still need to consider renting a freight train to deliver everything you need) and the online bulletin boards come to the rescue. Hoping to save money, you define a list of things that are no longer useful to you, place them for sale, and in anticipation of a bargain you begin to wait for calls and ... They are not there. What's the matter? It turns out that a discerning buyer is interested not only in the fact that “the mower is in excellent condition”, but also the engine power, the direction of grass emission, the position of the shaft, the operating time, etc. Without being a specialist in garden equipment, how could you foresee all this? And now you start to browse other ads on a similar topic, and as time goes on, your man has already ordered a barge and two cargo planes for transportation by country logistics. Using one of the headings of the bulletin board as an example, we will look at building a predictive model that would help find out exactly what people would like to know from the description of your proposal, as well as give a very rough estimate of the number of conversions to your ad.
Here I tried to describe the whole picture, the big picture, the details are available by reference to the code and data at the end of the post. The following assumptions are made in the article:
- The number of transitions is inversely proportional to the time of sale of goods
- For other cities (in the article only about the capital) and headings, the analysis can be done by analogy
')
Dataset description
With the help of the python library urllib 3879 records were obtained from one popular site. Rubric - dogs, the city of Moscow. When selecting ads, I tried to leave only non-commercial offers of transfer to good hands, so the breed was not specifically mentioned. Description of selection fields:
- description - the full text of the announcement
- identificator - number of ads on the site
- num_counts - the number of ad visits since the beginning of its placement
- price - the price for which it is proposed to buy an animal. usually, volunteers put 100r. or do not indicate the price at all
- start_date - the date when the ad was posted
- title - the name of the ad, how it looks on the first page
The first 5 entries:

Purpose of the study
Develop a model to predict the dependence of the number of views per day on the ad description and determine the most significant words for this rubric.
Data preprocessing
The
num_counts field contains the number of clicks since the start of the
start_date publication. Since each entry has a different publication time, it is necessary to divide the number of visits by the number of days elapsed from the date of publication to the time of receiving the data, thus we will get a rough estimate of the number of visits per day, and we will predict it. For text analysis, the bag of words model is used. So the plan:
- Stemming to eliminate the use of the same word in different forms as different signs
- The "date" field contains the date in the form of a string, so it needs to be converted to the correct format for analysis.
- The description field is taken as a sign, so the text needs to be translated into the bag of words view and tf-idf should be applied. At the same time stopwords are removed from the text: prepositions, auxiliary particles, etc.
- After several unsuccessful attempts to restore the regression between the document-term matrix and the average number of visitors, it was decided to split the target variable into intervals (quartiles) and consider the classification problem (hence the tf-idf). Those. at the output, the model will predict the interval where the average traffic for this ad is contained. Conversion to quartiles was done only on the training sample, so you need to write a function that converts the test sample too. You cannot completely convert the entire sample, since then the test data will indirectly participate in the training
- The 'price' field is the price for the animal. Big prices are an indicator of the sale of a purebred animal, we are interested in non-commercial activities, so we leave only those records for which price <500 rubles or not specified
- Splitting into train \ test. Moreover, train will be trained and the selection of parameters on the grid for cross-validation, and the final quality will be checked on test. The main metric is accuracy
After all the transformations, the output will be the document-term matrix and the target variable
mean_count , divided into quartiles (I chose the number of quartiles equal to 5).
Exploration analysis
The number of views per day has a power distribution, perhaps this rubric is not popular in principle:

It is interesting to look at the scatterplot between the number of words and the number of views:

You can see that shorter ads have a greater number of visits. Here I would suggest such an explanation - in long advertisements a potential owner often describes a model of communication between him and the pet, for example:
If you like home peace, then Romush will lie quietly at your feet and will enjoy watching a film with you, which you will then certainly discuss together over a cup of hot chocolate with cheesecakes. And with him you will be very comfortable and warm in cold evenings. If you have children and your house looks like a “children's dreamland”, then Romush will be ahead of everyone to run with a shout of “Banzai”, thereby amusing the kids, who will simply squeak with delight from their new friend!
Since all people are different, such an announcement can immediately eliminate people representing their communication in a different way. I'm not sure that this is good, because the communication model is an extremely subjective view of the volunteer and the person loses interest in the ad, not because the dog does not fit him, but for non-objective reasons - he tried on the wrong model. The second possible reason is the description of a hard life in a shelter. There is no doubt that life there is not sugar, but the average person, having read such a text, can endure severe stress and unconsciously will try to forget about it as a traumatic memory (this is my subjective hypothesis).
Baseline for model
The target variable was divided into 5 intervals (read - classes):
(13.599, 324] 454
[0.0888, 1.184] 454
(5.334, 13.599] 453
(2.436, 5.334] 453
(1.184, 2.436] 453
Those. there are 454 records where the target variable takes values ​​from the interval (13.599, 324], etc. If we predict all the time at any particular interval, then the number of correct answers will be approximately 0.2, we will choose this value as the base level, the quality of which we wanted would improve.
Model
After several experiments, I chose a random forest as a classifier. Various parameters were configured via grid search for cross-validation with the number of folds equal to five. Training takes approximately 15-20 minutes on an intel i7. The average quality on cross-validation on the metric accuracy was 0.386, which is almost twice the prediction constant value. On a delayed sample that previously did not participate anywhere, accuracy = 0.384 In the tables below, it is clear that the classifier better distinguishes the extreme values ​​(intervals [0.0888, 1.184] and (13.599, 324)) and worse than adjacent ones:


Perhaps the quality of the model can be improved by adding photos to the text. To extract features from a photo, you can try to use convolutional neural networks, for example, AlexNet.
Meaning of words
Let's look at the top 50 words that are important in the classification:

The schedule does not contradict intuition: people are interested in how old the animal is and what sex, whether the dog walks on a leash, whether it is more suitable for a family or single people, how it gets along with other pets. It can be concluded that this is the minimum information that should be included in the ad.
Sources
Dataset and
ipython laptopConclusion
We have already seen that the number of views for the category “animals as a gift” is not high, and even less for shelters than for individuals. Perhaps this is due to insufficient information of people and various prejudices. I will give some facts:
- Ads are posted by volunteers, whose interest is to provide the best possible conditions for their ward. They do not pay money for how many animals will be able to attach. If you have problems, you can return the animal back. Therefore, the volunteer has no desire to let the sick pet get away. If the animal requires special care, then such things are always stipulated in advance, and you can count on all (reasonable) support from the volunteer
- In shelters, they monitor the epidemiological situation, otherwise, under conditions of stress and medium quality feed, all animals would have died long ago.
- In the shelter, a lot of animals that were home, but were lost, ran away from the owners during a car accident or some kind of incident, or simply became unnecessary and inconvenient. Those. these are not wild wolves
- With each animal that you see on the ad at least once a week, or even more, the volunteers conduct training - they walk on a leash, teach the teams, so there is constant contact with the person
- You can also take part in this.
- There are cats in shelters too.
- There are small and medium sized dogs in shelters.
If ever you want to get an animal for yourself, be sure to check if all of a sudden someone is looking at you from a photo here:
Thanks
This analysis was carried out as part of the final project of the course
“Machine Learning and Data Mining” at DPO HSE, so many thanks to our teachers for their patience and work, as well as to my supervisor.
PS About all inaccuracies and typos, write in a personal!
UPD. The user andraszsom in the framework of the
competition on kaggle laid out an analysis of the relationship between the different outcomes of life in the shelter (euthanasia, or the animal is given for adaptation to the family, etc.) on the breed, age and other signs,
link .