📜 ⬆️ ⬇️

Automatic summarization system for three languages

I want to tell you about the news text reporting service developed by me in English, Russian and German.

Automatic summarization (summary) systems (CAP) are a rather specific topic and will be of interest mainly to those involved in automatic language processing. Although a perfectly executed Sammaraiser could be a useful assistant in areas where it is necessary to overcome the information overload and quickly decide what information is worth further consideration.

What is the situation?


On the one hand, in the process of searching for analogs, I noticed an interesting thing - most of the articles I found, services, repositories, etc., date back to 2012 at the latest. On Habré there is an article on the topic of automatic summarization, published in 2011. In the same year, the news summarization was last included in the track list of the TAC conference.

On the other hand, mobile applications are gaining popularity, which process news feeds and present the user with short essays on selected topics. A striking example of such demand is the relatively recent (2013) purchase of Google and Yahoo by Wavii and Summly startmars-startups, respectively, as well as the presence of various browser-plug-ins that review the web pages ( Chrome , Mozilla ).
')
A quick test of free online summarization services shows that most of them work the same way, producing equally average (bad?) Results, among which, perhaps, the Autosummarizer stands out for the better .


Why another SAR?


The initial goal of the project is to serve as a platform for learning programming in general and programming in python in particular. And since the topic of computational linguistics is close to me by type of activity, I chose reviewing as the object of development, besides, there were already some ideas and materials on it.

If you walk through the services from the above list, you can see that they mostly work with English texts (if you can get them to function at all). You can choose another language in MEAD , OTS , Essential-mining , Aylien, and Swesum . At the same time, the first one doesn’t have a web interface, the third after 10 test texts requires registration, and the fourth, giving the opportunity to set the settings in the demo, for some reason does not want to refer to anything.

As I got something worthwhile with the processing of English texts, I wanted to make a service that would work with Russian and German news articles and would work as well as the ones listed, and also provide an opportunity to compare the developed algorithm with the TextRank methods that are popular today. , LexRank, LSA and others . Besides, this is a good opportunity to practice with html, css and wsgi.

Where to look?


Project website: t-CONSPECTUS

How does it work?


t-CONSPECTUS is an extract type self-extractor, i.e. he forms a summary of the sentences of the original article, which in the process of analysis received the greatest weight, and, therefore, best convey the meaning of the content.

The whole process of summarization is carried out in four stages: preprocessing of the text, weighting of terms, weighing sentences, extracting meaningful sentences.

During preprocessing, the text is divided into paragraphs and sentences, a heading is found (needed for correcting the weights of terms), tokenization and stemming is carried out. For Russian and German languages ​​lemmatization is carried out. Pymorhpy2 lemmatizes Russian texts, to process Germans, I had to write my lemmatization function based on the CDG parser lexicon , because neither NLTK, nor Pattern, nor TextBlob German, nor FreeLing provided the necessary functionality, and the selected hosting does not support Java, which excluded the use of Stanford NLP .

At the stage of weighting the terms, keywords are determined using the TF-IDF. The term gets an additional factor if:

  1. met in title
  2. met in the first and last sentences of paragraphs
  3. met in exclamation or interrogative sentences,
  4. is a proper name.

Weighing proposals is carried out according to the method of symmetric summarization.

A detailed description is given in the article “Yatsko V.A. Symmetric summarization: theoretical foundations and methods // Scientific and technical information. Ser.2. - 2002. - № 5. "

With this approach, the weight of a sentence is defined as the number of links between a given sentence and the sentences to the left and right of it. Links are keywords common to the offer and its neighbors. The sum of left and right relations is the weight of the proposal. There is a limitation - the text should consist of at least three sentences.

In addition, when calculating the final weight of a sentence, its position in the text (the first sentence is the most informative in the news texts), the presence of proper and numeric sequence names and the length of the sentence are taken into account. Additionally, a penalty factor is applied, which reduces the weight of long sentences.

The specified number of significant sentences is selected from the list sorted in descending weights, and the extracted sentences are placed in the order in which they appeared in the original in order to at least somehow comply with the coherence of the text. The size of the sammari is by default equal to 20% of the original volume.

What is the quality of sammari?


The traditional approach to assessing the quality of sammari is a comparison with a human abstract. The ROUGE package is by far the most popular tool for such an assessment.

Unfortunately, it is not so easy to get the standards, although, for example, the DUC conference provides the results of past competitions of summarizers, including human abstracts, if you go through a number of bureaucratic procedures.

I have chosen two fully automatic metrics, justified and described in paragraph 3 here (pdf), which compare summaries with the original article. These are cosine similarity and distance of Jenson-Shannon ( Jensen – Shannon divergence ).

The Jenson-Shannon distance shows how much information will be lost if you replace the original with an abstract. Accordingly, the closer the indicator is to zero, the better the quality.

The cosine coefficient, classical for IR, shows how close the vector of documents is to each other. I used tf-idf words to build vectors. Accordingly, the closer the indicator is to 1, the more the abstract on the density of keywords corresponds to the original.

The following systems were chosen for comparison:


For each language, uninterested users selected 5 texts from the areas of “popular science article” (popsci), “environment” (environ), “politics” (politics), “social sphere” (social), “information technologies” (IT ). Evaluated 20% of essays.

Table 1. English:
Cosine Similarity (-> 1)Jensen – Shannon divergence (-> 0)
t-CONSPOtsTextrankRandomt-CONSPOtsTextrankRandom
popsci0.79810.77270.82270.51470.52530.42540.36070.4983
environ0.93420.93310.94020.76830.37420.37410.2940.4767
politics0.95740.92740.93940.58050.43250.41710.41250.5329
social0.73460.63810.55750.19620.37540.42860.55160.8643
IT0.87720.87610.92180.69570.35390.34250.33830.5285

Table 2. German:
Cosine Similarity (-> 1)Jensen – Shannon divergence (-> 0)
t-CONSPOtsTextrankRandomt-CONSPOtsTextrankRandom
popsci0.67070.65810.66990.49490.50090.4610.45350.5061
envir0.71480.67490.75120.22580.42180.48170.40280.6401
politics0.73920.62790.69150.49710.44350.46020.41030.499
social0.6380.50150.56960.60460.46870.48810.4560.444
IT0.48580.52650.66310.43910.51460.5370.42690.485

Table 3. Russian language:
Cosine Similarity (-> 1)Jensen – Shannon divergence (-> 0)
t-CONSPOtsTextrankRandomt-CONSPOtsTextrankRandom
popsci0.60050.52230.54870.47890.46810.5130.51440.5967
environ0.87450.81000.81750.79110.3820.43010.40150.459
politics0.59170.50560.54280.49640.41640.45630.46610.477
social0.67290.62390.53370.60250.39460.45550.48210.4765
IT0.840.79820.80380.71850.50870.44610.41360.4926

The texts of the original articles and the summaries received can be found here .

Here and here you can download third-party packages for automatic assessment of the quality of sammari.

What's next?


Further, it is planned to gradually improve the algorithm, for example, to take into account synonyms when searching for keywords of an article, or to use for these purposes something like Dirichlet latent placement ( Latent Dirichlet Allocation ); decide which parts of the text need special weighting (for example, numbered lists); try adding more languages ​​etc. etc.

On the site itself, add quality indicators to the statistics, add a visual comparison of the results of the "native" algorithm and third-party, etc.

Thanks for attention!

Source: https://habr.com/ru/post/271771/


All Articles