Classified ads from social. networks. Looking for a better solution

I'll tell you how the text classification helped me in finding an apartment, and also why I refused regular expressions and neural networks and began to use a lexical analyzer.

About a year ago I needed to find an apartment for rent. Most of the ads from individuals are published on social networks, where the ad is written in free form and there are no filters to search. Manually viewing publications in different communities is long and ineffective.

At that time, there were already several services that collected ads from social networks and published them on the site. Thus, it was possible to see all the ads in one place. Unfortunately, there were also no filters by type of ad, price. So after some time I wanted to create my service with the functionality I needed.
')

Text classification

First try (RegExp)

At first I thought to solve the problem in the forehead using regular expressions .

In addition to writing the regular expressions themselves, I also had to do the subsequent processing of the results. It was necessary to take into account the number of entries and their relative position relative to each other. The problem was with the processing of text on the proposals: it was impossible to separate one sentence from another and the text was processed all at once.
As the regular expressions became more complex and the results were processed, it became harder and harder to increase the percentage of correct answers on the test sample.

Regular expressions used in tests

- '/(|\d.{0,10}[^])/u' - '/(\D{4})/u' - '/(((^|\D)1\D{0,30}(\.||)||)|(\D{0,3}1(\D).{0,10}))/u' - '/(((^|\D)2\D{0,30}(\.||)|.{0,5}||.{5,10}(\.||))|(\D{0,3}2(\D).{0,10}))/u' - '/(((^|\D)3\D{0,30}(\.||)|(|).{0,5}|(|)|(|).{5,10}(\.||))|(\D{0,3}3(\D).{0,10}))/u' - '/(((^|\D)4\D{0,30}(\.||)|\S)|(\D{0,3}4(\D).{0,10}))/u' - '/()/u' - '/(.{1,5})/u' - '/(|||(|))/u' - '/(\?)$/u'

This method for the test suite gave 72.61% correct answers.

Second Attempt (Neural Networks)

Recently it has become very fashionable to use machine learning for anything. After learning the network, it is difficult or even impossible to say why it decided this way, but this does not prevent us from successfully applying neural networks in the classification of texts. For the tests, a multi-layer perceptron was used with the method of learning back-propagation errors .

The following libraries were used as ready-made libraries of neural networks:

FANN written in C
Brain is written in JavsScript

It was necessary to convert the text of different lengths so that it could be fed to the input of a neural network with a constant number of inputs.

For this, from all the test sample texts, n-grams of more than 2 characters and repeated in more than 15% of texts were identified. There were a little more than 200 .

N-gram example

 - //u - //u - //u - //u - //u - //u - //u - //u - //u

To classify one ad, n-grams were searched in the text, their location was determined, and then this data was fed to the input of the neural network so that the values were in the range of 0 to 1.
Such a method for the test set gave 77.13% of correct answers (despite the fact that the tests were performed on the same sample for which the training was conducted).

I’m sure that with a few times more test setups and the use of feedback networks, much better results could be achieved.

Third Attempt (Parser)

At the same time, I began to read more articles about the processing of natural language and came across a wonderful Tomita parser from Yandex. Its main advantage over other similar programs is that it works with the Russian language and has quite intelligible documentation . In the configuration, you can use regular expressions, which is very useful, since some of them I have already been written.

In essence, this is a much more advanced version of the variant with regular expressions, but much more powerful and convenient. It is also not without text processing. The text that users write on social networks often does not meet the grammatical and syntactic norms of the language, so the parser has difficulty processing it: splitting the text into sentences, splitting sentences into lexemes, converting words to normal form.

Configuration example

 #encoding "utf8" #GRAMMAR_ROOT ROOT Rent -> Word<kwset=[rent, populate]>; Flat -> Word<kwset=[flat]> interp (+FactRent.Type=""); AnyWordFlat -> AnyWord<kwset=~[rent, populate, studio, flat, room, neighbor, search, number, numeric]>; ROOT -> Rent AnyWordFlat* Flat { weight=1 };

All configurations can be found here . This method for the test set gave 93.40% correct answers. In addition to the classification of the text from it also highlights the facts, such as: the cost of rent, apartment area, metro station, telephone.

Try parser online

Requesting:

 curl -X POST -d '  50.4 .  30   .  + 7 999 999 9999' 'http://api.socrent.ru/parse'

Answer:

 {"type":2,"phone":["9999999999"],"area":50.4,"price":30000}

Types of ads:
0 - room
1 - 1 bedroom apartment
2 - 2 bedroom apartment
3 - 3 bedroom apartment
4 - 4+ bedroom apartment

As a result, with a small test set and the need for high accuracy, it turned out more profitable to write algorithms manually.

Service development

In parallel with the text classification problem, several services were written to collect ads and present them in a user-friendly form.

github.com/mrsuh/rent-view
The service that is responsible for the display.
Written on NodeJS . The doT.js templating engine and Mongo DB were used.

github.com/mrsuh/rent-collector
The service that is responsible for collecting ads. Written in PHP . Symfony3 framework and Mongo DB are used.
I wrote with the intention of collecting data from various sources, but as it turned out, almost all ads are posted on the social network Vkontakte. This social network has an excellent API , so it was not difficult to collect ads from the walls and discussions in public groups.

github.com/mrsuh/rent-parser
The service that is responsible for the classification of ads. Written in Golang . Uses the Tomita parser. In essence, it is a wrapper over the parser, but also performs preliminary text processing and subsequent processing of the parsing results.

For all services, CI is configured using Travis-CI and Ansible (as configured automatically deploy, wrote in this article ).

Statistics

The service has been running for about two months for the city of St. Petersburg and during this time managed to collect a little more than 8,000 ads. Here are some interesting statistics on ads for the entire period.

On average, 131.2 ads are added per day (more precisely, texts that were classified as ads).

The most active hour is 12 days.

The most popular metro station Devyatkino

Conclusion : if you do not have a large test sample, on which you could train the network, and at the same time you need high accuracy, then it is best to use hand-written algorithms.

If someone wants to solve this problem himself, then here lies a test set of 8000 texts and their types.

Source: https://habr.com/ru/post/328282/

All Articles