Recently on Habré an article appeared about
automatic article
referencing . It so happened that I, too, do automatic text analysis and have achieved some success in this.
I managed to get the algorithm to find repetitive and similar texts. It also automatically determines the proximity of the text to certain topics and highlights from the total mass those texts that make up some of the mainstream. That is, the reader will not have to sift through all the information in order to understand the main thing. With an increase in the volume of the analyzed texts, all poor quality, uninteresting, obscene, irrelevant, etc. will be automatically sifted out.
The idea of the algorithm is that the text is divided into chains, then their comparative analysis is made, special markers are selected, on the basis of which decisions are made.
The analysis is performed fully automatically without a moderator and a chief editor. Because of this, the algorithm is sometimes wrong and can put the text in the wrong section, but the reason for this is rather that the original set of texts is usually grouped with even less care. Over time, the algorithm becomes more and more accurate, as over time, enough statistical information accumulates.
')
This is not all. The algorithm is able to understand humor. If the text is knocked out of the general mass and shines with discrepancies, the algorithm will select it and mark it as “Humor”. The algorithm is quite good at finding jokes, and if something is not funny, it’s more likely that the algorithm does not work very long, only a few weeks. That is, he has not had time to understand that something is already not funny.
Also in the automatic mode, you can find new ideas. For example, in the city of Kopeisk, the Internet was held in the maternity hospital so that fathers did not stand under the windows and did not shout out their wives in an attempt to see the face of their child from afar, but looked at it through Skype. Or residents of Yalta are advised to wear a whistle and arm themselves with gas sprays, since the season of thefts and robberies is open in the city. But Poland will advertise its apples in Russia with EU money.
Ideas are not always sought out interesting, but with the accumulation of "experience" and this should be corrected. Those who wish can also find a suitable idea among what is already being discovered.
The algorithm is valid and its work can be viewed on the website
nfos.ru. The site is engaged in collecting news from several sources, analyzing and publishing everything that it sees fit to publish. Now I can boast that I know all the main news without straining. What you want.
For example, do you already know that a case was opened on Navalny on suspicion of raiding? Or have you heard that a record number of Osama portraits have been sold in Pakistan?
I think that the algorithm is suitable not only for analyzing texts, but also for analyzing images and other unstructured data or data with no obvious structure. For example, for filtering out noise, for decoding, analysis of algorithms by the results of their work, etc. etc. The algorithm is potentially suitable for predicting stock prices and exchange rates, but I’m unlikely to get to all this in the near future, as there is not enough time.
The entire analysis algorithm fit into 40 kilobytes of PHP code plus about 70 kilobytes of code for the design of a news resource. Agree that for the appeared functionality it is just a miser. But what is really voracious algorithm, so it is on the occupied space. Within a few weeks, more than 1.5 GB of information had already been accumulated in the database. And this volume is constantly growing.
The algorithm is practically not sensitive to failures. If at some point in the database inaccurate, distorted, false, faulty information enters the database, then it either does not affect further analysis, or its effect will become insignificant over time.
Finally I want to say that the analysis did not require a powerful hosting. All news from about 150 sources are analyzed on a cheap hosting FirstVDS-Acceleration which costs 249r / month. The processor time, of course, is not enough, but it made me optimize the algorithm, which I managed to do without any visible losses.