📜 ⬆️ ⬇️

Krovi: Big Data - as dream. 9th series: Why IBM was forced to buy "Alchemists" for $ 100 million

In previous episodes: Big Data is not just a lot of data. Big Data is a process with positive feedback. "The Obama button" as the embodiment of rtBD & A. Philosophy of the development of Big Data. In this series, we will talk about linguistic analysis of high-speed streams of unstructured texts and social media messages, and present Heuric, our answer to Alchemists.

The Internet, in its current perception of society, is a related set of messages: personal correspondence in messengers, links between articles in the media, discussions in blogs, game chats, thematic series on Habré, or, as changed in the worldview of new generations - links to search engine answers after recruitment query "What to do today?"

If you look closely, then the basis of the basics: Relationships and Topics . We’re not going to talk about the analytics of “connections” (this is for the NSA, whose e-surveillance capabilities today have been refused even by the “all-powerful US Senate”). But Thematic Analytics (which recently got its name - Brand Analytics - in a press release between Facebook and DataSift, and in Russia it has been around for 3 years as a project name) and various delicacies associated with it are a great topic (! :-)) for the new series.

In order not to inflate the series, we present, thesis, the current "threat level" and links to specific cases for which new solutions and approaches were required for those who want to search more deeply:
')
- The volume of communicative messages generated by humanity is approaching 20 billion a day , the main flow is non-public (various messengers, mail).

- The volume of public Russian-language messages in social media (social networks, Twitter, comments in the media, blogs, forums, photo and video hosting, review sites, etc.) - 1 billion per month . The volume of "classic" editorial and "literate" media reports is less than 1% of the total data flow (up to 10 million out of 1 billion).
Open real-time statistics of social media and media data streams are available at br-analytics.ru/statistics

- To process 30-40 million messages per day (1.000 messages per second at the peak), new data processing techniques and algorithms are needed. Social media streams are unstructured “illiterate” (non-classical media), slightly connected, with a large number of spelling and punctuation errors, often multi-meaning and multilingual messages.

Tasks and problems that need to be addressed in the modern dynamic world (practical cases of previous years):

- The “All World” action (case dated October 1, 2013) is the task of the “Operational Sociology” class: real-time monitoring of the reaction to dynamically changing, influenced by popular media people, interested and a large part of the society; identification of sign, unpredictable, modulating the active distribution in society, messages for quick response from the structures involved in the discussion (in this case - TV channels and mobile operators).

--- “Straight line with Putin” (case of April 25, 2013)
- Obama Button class task: real-time highlighting of unknown active topics and determining the tonality of each topic.

- “Love and Hate” on the map of Russia , winter 2014-2015: a study of the emotional state of 35 million social media users in all regions of Russia.

- Quite today: thematic widgets for sites in the framework of the special project of the MinCult on the "Night of Museums"

From tapes (social network, Instagram photo, YouTube video):

We are waiting for you on the Night of Museums in the Lumières 2.0. We start at 20:00 with a tour of the exhibition “Soviet Photo” from… t.co/evIDYZVltl
twitter.comThe Lumiere Center 1 min. backwards

And yesterday we went to the night of museums))) It was very interesting
vk.com - Elena Ivanova - 2 min. backwards

Who wants a night of museums today ?? write me or call) the company will make 89260860xxx
vk.com - Nadezhda Porodzinskaya - 3 min. backwards

An hour later, I leave the house on the night of museums) Who wants it too - write)
vk.com Daria Klimovich - 3 min. backwards

... monologues, Lydia Masterkova about Vladimir Nemukhin and about himself. We are waiting for everyone, entrance ...
instagram.com - Moscow Museum Of Modern Art - 6 min. backwards

“Night of museums” in St. Petersburg: quest in the Mikhailovsky Castle, St. Petersburg, May 17, 2015
youtube.com - Today's News - 3 hours ago


To solve problems of this class, it was necessary to develop completely new approaches and solutions. IBM, SAP, Microsoft, Samsung and other giants over the past 10-20 years have spent billions of money on processing technologies for "classic" texts (media, corporate documents, archival data).

But these billions and developments do not help in solving new problems. And here, the one who makes decisions faster wins (see the Big Game series - megamozg.ru/company/palitrumlab/blog/14154 about Apple and Twitter in the fight for suppliers of unstructured Big Data). In the continuation of the IBM Big Game, “spitting” on previously spent funds (unlike SAP, which has been trying to solve the problems of Russian linguistics for 2 years with its European centers) , in March AlchemyAPI , which already has high-speed processing technologies for billions of texts, acquired several western languages.

On the rights of "advertising in the series," or rather, "for those who have long been looking for":

Our “report to Chamberlain” (which we mentioned in the 6th series ) followed immediately: in May 2015 we singled out new technologies as an independent separate public solution for the possibility of using by third-party companies - Eureka Engine (http://EurekaEngine.ru ), representing a high-load cloud solution and an industrial API for integration into existing or developed technological complexes by teams, companies and organizations.

Evrika is already working for the benefit of RIA Novosti and Samsung, Mail.ru and RosTourism, Atonomy and Brand Analytics, agencies and companies in different countries. If you have the task of processing large streams of unstructured data (thematic plotting for editing, sorting a heap of incoming documents into the correct departments, determining the language of texts, identifying named entities, etc.) - welcome!

There is always a solution, right?

Source: https://habr.com/ru/post/258607/


All Articles