Sorting unstructured data flow

In the last article I wrote how we sort companies on YPAG.RU into sections using a neural network.
Many asked to describe the algorithm. I will describe a universal approach for sorting data.

1. You need to analyze the added text and define keywords in it. There are many algorithms for determining keywords, I used the laws of Zipf , by the way on this subject I had to write a thesis project.

2. After defining keywords, you need to do a relevant search for these keywords in the database of already structured documents.
')
3. The 20 most relevant documents are selected and a section rating is built on them. After that, the most popular sections from this sample are selected. This threshold is configured individually, we have a threshold - more than 5.

4 We still have the position of the document in the section on YPAG.RU. The position is calculated as follows: the positions of the found documents of the section are determined and the average position is calculated. If the company is interested in visitors - the position is gradually growing.

This way you can structure your data efficiently. The error is 3-5%.
The main problems arise if the text is not precisely worded. For example: bulk purchases. It is not clear what, how.

Source: https://habr.com/ru/post/93882/

All Articles

Sorting unstructured data flow

More articles: