Krovi: Big Data - as dream. 6th series. BD (Bolt Data) - Fast Big Data Data

In previous episodes: Big Data is not just a lot of data. Big Data is a process with positive feedback. "The Obama button" as the embodiment of rtBD & A. Philosophy of the development of Big Data. In the new series for the first time we mention the new E-engineering - the realization of the dreams of IBM, Google and others.

Only the lazy (including the screenwriters of our series) have not expressed their opinion on “Who is Big Data?” Today, let's not speculate about the volumes, but about the rate of data flows. The English word Bolt has so many meanings that it is easy to find a different meaning for the two-letter BD instead of the Big Data - Bolt Data , including: a lightning strike, fly out, blurt out, speak quickly and indistinctly.

Fashion fad to pay attention only to the volume (Big) has already led to a massive disappointment of the ordinary population. Here is the next representative of the next portal speaking at the next conference, say, with the summary database: “We have a really huge Big Date! 20 million resumes! Last month we moved to a new 8-64-192-core server with 4-8-32 TB of memory! ”

We breathe evenly and present the picture of Ancient Egypt: 20,000 slaves drag huge stone blocks and build the next, 105th, Pyramid of Cheops. Since the TASK defines the solution, and not the SOLUTION invents a task for itself , then for the local Tutankhamen and the “Ancient Egyptian resume portal” such a volume of data (20 million cards) is to spit and grind.
Imagine a picture: scratching a thick belly in the morning, goes to the balcony of MantesumHeops-XXI and says: “I’ll find 5 new foot washers by the evening, I had to feed yesterday’s lions”. He turns and leaves, and the work begins to boil: each of the 20,000 slaves throw stone blocks, grab 1,000 resume, quickly scan each for 20 seconds, and for dinner, the Chief Eunuch has 20-30 resumes for the interview. MantesumHeops-XXI and his hungry lions are satisfied, full and happy. And the slaves, too, took rest from dragging terra-byte stones ("cores").

As you can see, the result is achieved in time and without any clever words . And whether someone calls this process Big Data or not - the ancient Egyptians by papyrus. So when you see another cliché, relax, and think about Ancient Egypt :-)

Today (the material was posted on the Megamind on April 16th) the next Straight Line from VV Putin From a technological point of view, the problem is much more interesting (we already discussed the Obama Button in the last series) than the summary pyramid, in the way that for the younger scientific and technical generation and for those interested in new Egyptians it’s possible to discuss Bolt Data with a real example and talk about linguistics.

Here is a graph of the reaction (see above one of the translations of the word Bolt - speak quickly and unintelligibly ) of hundreds of thousands of Russian-speaking social media users: journalists, politicians, economists, moms, dads, grandmothers and grandchildren:

Is it possible to process such a “stream of consciousness” with the help of 20,000 ancient Egyptian slaves? Does not work. After all, only 2-3% of discussions / comments occur in widely public places (large groups in VK or FB, text broadcasts of federal agencies and the media), the rest of the “people's cries” occur in the mouthpieces of personal accounts for friends and girlfriends. Watch for each of the billion accounts Twitter, Facebook or VKontakte - the people on Earth is not enough.

These are the tasks we call rtBD & A - real-time Big Data & Analytics (in Russian, such as: analytics of unstructured large-scale data in real time). With " rt " - understandably, with BD (Big / Bolt Data) - also understandable, just a time limit factor has been introduced (there is a corresponding term for the radio frequency in radio engineering), let's open A - Analytics a bit. Let us leave aside the problematic of “listening” to millions and billions of public messages (we talked about these systems in the previous series), let's talk about the problem of “HEARING” , as well as the need to “understand” the language of birds, animals and people.

This is where the cool system of modules E-ngine comes in handy (the name of the system is of course different, but before the public announcement we’ll dwell on this one, for our series it doesn't matter): on the “live stream” of data generated by millions of people, you need:

- Determine the language of the message ;
- To conduct linguistic text processing ;
- Determine that the text is about “Putin”, and not about “putIN” (if someone does not know, this is the time of commercial fishing);
- Classify the message (identify existing topics or propose a new one);
- Identify NER objects (named entities - surnames, settlements, plant names, etc.), and not using dictionary methods (well, the Chelyabinsk meteorite object was not in the dictionaries and Wikipedia before the crash);
- Determine the tonality of the utterance (positive-neutral-negative), and an important objective tonality, and not just “as is usually done”;
- and even every little thing ...
- For dessert: literacy and punctuation of our texts in social media - well, you know yourself :-)

To enhance the presentation, let's estimate on fingers: in 4 hours (straight line time) in publicly popular social media (microblogging, social networks, news and comments, forums, blogs, videos, reviews, reviews) users generate about 8-10 million Russian-speaking (Cyrillic ) messages (our public real-time statistics on social media ). Those. for on-the-fly processing, you need to manage to process up to 1.000 unstructured messages per second and thresh such a flow with E-modules.

The average “at the hospital” length of messages on the Russian-language Internet is ~ 1 Kb. Rate the speed of E-ngine you can own. For evaluation, you can use the presentation data of the Compreno system (developed by our friends and the wonderful Abbyy team) - a very powerful and excellent tool that has been spent on developing thousands of person-years: processing 1 KB of text takes 5-10 seconds, but the quality of processing language "- very high.

So, a summary of the series:
1. We do not catch on the already beaten and sometimes even “killed” term Big Data - the term is clearly waiting for the fate of the proud term of the 90s “Portal”, which can be found in the name everywhere and everywhere, such as “Portal of the evening dance club in the village of Podosinoviki”.
2. Through squinting we estimate the magnificent length of the legs of a new PR woman, who is chattering about “our petabytes” to no one needs the necessary data. Data is needed .
3. And in time .
4. Intellectual solutions, methods and algorithms have the greater value, the higher the speed of solutions, methods and algorithms . Not all tasks can be dragged to 20,000 ancient Egyptian slaves.

And between the series, you can speculate at your leisure about the new way of the “Blue Giant”: IBM sold Lenovo a PC unit, got hooked on Twitter, sent 10,000 employees to retrain in Data Scientist, and recently bought AlchemyAPI (a great E-engine engine for several Western languages).

Against the background of a long-lived and “forever young” IBM (throws out the old, swiftly changes to the new), the fast life of the once great and ambitious Sun Microsystems (the wonderful servers, by the way, and Java are very much alive), and now the new news, that once the world Finnish leader of the mobile world Nokia (recently acquired by Microsoft) decided to pocket the “unsinkable and eternal” Lucent / Alcatel, which even a couple could not resist the Chinese.

Do not stop for a long time under beautiful signs, no matter how Big Data calls them - these are just beautiful, untwisted names. Move - solve problems, not learn solutions . We wish to constantly change and open new roads - it is so interesting to give new solutions to new solutions.

PS Does your company have an understanding of how to solve problems like the ones mentioned above in a “non-Egyptian way”? Do you feel the makings of Data Scientist and understand how to “identify” the situation with the “Chelyabinsk meteorite” in 3 minutes, not 3 hours (as the press reacted)? Are you able to algorithmize the identification of new methods of spam bots Twitter? Then you are on one of many, but definitely the right path - you have a bright future.

In the following series: NoSQL or column DBMS, from where the feet of hearing grow, that “data is running out”, humanity is like a world garbage bin.

