In football language: Big Data + linguistics for the World Cup widget

Most of us will watch football worlds. And let the experts say that as always nothing shines for our national team, the beauty of world football will capture even those who do not consider themselves to be a fan. Have you ever thought about the beauty of “joint pain”? Feel yourself on a huge global stand, hear what fans and fans from different countries say and feel, see the games through their eyes ... Modern technologies for processing unstructured data make the fans dream a reality. Every minute, thousands of football tweets, Instagram photos and YouTube videos are created by us, the fans, right now, even before the start of the World Cup. Can you imagine what will happen during the matches? It remains to assemble the “World Tribune of Fans”, which we will make from scrap materials together, quickly, and under the cut.

The designer of the class Online Big Data "World Stand" gathered on the Lego principle, from several perfectly complementary parts available:

1. Content:
Every day, humanity (we also think of the whole world!) Generates more than 1 billion public messages (tweets, posts, comments, photos, videos) in social media. Each social network, blog platform and other Internet services have their own rules, so we need a public message aggregator (American TopSy and Gnip, English DataSift or Russian Brand Analytics).
')
2. The aggregator of the desired content:
We need not all a billion posts, but only about football. But in different languages. And with morphology, syntax, language definition, lemmatization, post-correction. And do not forget about real time! Poorat "Goool!" With half the globe need for the present the ball in the gate pulls the net, and not in the morning news.

3. Autotranslator:
For posts social networks. We start to laugh. The choice is classic: Google translator or Translate.ru.

4. Team:
Programmer for API binding and a good layout maker - so much the same without creators!

And here is the result of several days of work - widgets that are available both for viewing and inserting on the website / blog:
- For the Russian-speaking sector http://br-analytics.ru/widget-generator-theme/wc2014ru
- For cosmopolitans and those who support Brazilians / Spaniards / English and other favorites br-analytics.ru/widget-generator-theme/wc2014

Below we provide more detailed technological information for fans of not only football.

Content aggregation

In the Russian market, the position of the main data provider from social media is taken by the Brand Analytics (BA) system, which makes it easy to set up and receive a filtered stream of thematic data, taking into account Russian morphology and syntax in real time. In contrast to DataSift, BA accumulates not only data from social networks, but also messages from blogs and forums, news portals. BA has a public API for retrieving filtered data.

The most painstaking and costly business in such systems is to set up filtering: key phrases, minus words, and authoritative sources. This work was attended by real experts - employees of the popular sports portal Championat.com.
The system has a bot filter, so the widget receives messages only from real users, and the profanity of off-the-shelf emotions is cut off by special filters.

Translation difficulties

Translate.ru was chosen as a translator, which, in addition to the simple API to the World Cup, had a set of special linguistic modules and dictionaries, which made it possible to significantly improve the quality of translation. For autotranslation 4 languages were chosen, the most common in the context of the World Cup: Portuguese (aka Brazilian), Spanish, English and Russian.

Real time processing

In connection with the daily growing flow of new information, the processing speed is becoming an urgent problem today. At the recently concluded international conference on computational linguistics "Dialogue", some modern linguistic systems of well-known companies were presented. As calculations show, their preprocessing speed is still low and does not allow working with a real data stream: the best systems show dozens of KB / s on a single processor, while practice shows that in order to fully work with the stream, the speed should be measured in hundreds of KB / s.
The speed of our system also does not reach the ideal, but, nevertheless, today we can process up to 15 GB per day on one stream (~ 200 kB / s). This processing speed is provided by an intelligent parallel computing system. The balancer of linguistic modules allows you to save a high percentage of correctness of stream processing. For example, the subtle, clever handling of such a phenomenon as homonymy allows the use of high-loaded algorithms only when it is really necessary.

It remains only to properly support ours, join.

PS: We are planning a cycle of publications on computational linguistics and text mining, with a story about the use of such technologies as auto-detection of message tonality, classification of entities, lemmatization and homonymy resolution, etc. If you are interested in one of the topics mentioned above, or other linguistic topics, write us , and we will try to reveal in detail all the secrets of the computerization of the great and powerful.

Source: https://habr.com/ru/post/225985/

All Articles

In football language: Big Data + linguistics for the World Cup widget

Content aggregation

Translation difficulties

Real time processing

More articles: