📜 ⬆️ ⬇️

Automatic reviewing of articles in Russian

The topic of automatic text annotation / annotation was raised a long time ago and many ways to implement it were invented. Since the desire to know the main thing is for everyone, but it is usually associated with viewing a variety of materials.

Finished libraries are not so easy to find, but what is there is poorly configured, not completed and, most importantly, works only for English. I wanted to fix this flaw and that's what happened .
For a couple of days, I wrote several variants of the summatization algorithm, taking from the Internet the Russian-language components of the Russian text analysis, mainly the AOT.

The main idea of ​​these approaches to referencing is the choice of the main sentences in the text, those that best convey the meaning of the entire text.
All three algorithms are modifications of LexRank.
')
The referencing in my case goes in three directions :
1. sentences . (the algorithm beats on sentences with some heuristics, so not all points are their end)
2. keywords - nouns (POS tagger based on AOT morphology is used to extract them)
3. actions - Object-action-subject. (also using POS tagger on AOT)

If you think any of the algorithms will be good enough.
I plan to do:
1. API,
2. review the RSS,
3. the ability to review by time intervals (by day, week).
4. browser plugins for highlighting in sentences.

If there are people willing to help in creating these things, write.

update1:
Added JSON output format, if you add the parameter json = true
update2:
Statistics collected in Google forms at the moment (210 voters):
Algorithm 1-3: 77% (first 50%)
all bad: 23%

Keywords proved useful for 70% of respondents.

As for me, not bad at all)

Source: https://habr.com/ru/post/118609/


All Articles