In
1cloud, we often talk about technologies, for example, we recently wrote about
machine learning and
all-flash storage
arrays . Today we decided to talk about Big Data. The most common definition of big data is considered the famous “3V” (Volume, Velocity and Variety), which was
introduced by Gartner analyst Doug Laney in 2001.
At the same time, sometimes the most important is considered the amount of data, which is partly due to the name itself. Therefore, many people think only about what data size can be considered large. In this article we decided to find out what is really important in big data besides the size, how they appeared, why they are criticized and in what areas they are successfully used.
/ Flickr / Joe Hall / CC-BY
')
If we talk about the size of Big Data, then, for example, David Kanter (David Kanter), president of Real World Technologies,
believes that big data can be called if they do not fit in the server's memory and weigh more than 3 terabytes. However, the official
definition of Gartner is much more voluminous and includes not only the characteristics of the volume, speed and variety of formats. Big data is also defined as information resources that require cost-effective and innovative processing methods for better understanding, making informed decisions and automating technological processes.
Therefore, Gartner analyst Svetlana Sikulyar (Svetlana Sicular)
calls to take into account the entire definition as a whole, and not to dwell only on the parts with three “V”. By the way, over time, the number of these “V” has grown, and today the characteristics of big data also
include Veracity, Validity, Volatility and Variability (accuracy, expiration date, volatility and variability).
Minute of history
But the history of big data begins much earlier. According
to one of the Forbes authors, the starting point can be considered the year 1944, when the American librarian Fremont Rider published his work The Scholar and The Future of the Research Library. There, he noted that the collections of university libraries in America doubled in size every 16 years, and by 2040 the Yale University library will contain about 200 million books, which will need almost 10 kilometers of shelves to store.
According to another
view , the awareness of the problem of too much data came earlier, back in 1880 in the same America, when the processing of information and the presentation of census data in the table took 8 years. At the same time, according to forecasts, the processing of the 1890 census data would take even more time, and the results would not be ready even before the new census. Then the problem was solved by the tabulating machine, invented by Herman Hollerith in 1881.
The term Big Data itself was first (according to the Association for Computing Machinery electronic library)
introduced in 1997 by Michael Cox (Michael Cox) and David Ellsworth at the 8th IEEE Visualization Conference. They called the big data problem a shortage of main memory, local and remote disk to perform virtualization. And in 1998, SGI research manager at John R. Mashey at SGI
used the term Big Data in its current form.
And although the problem of storing large amounts of data was recognized long ago and intensified after the appearance of the Internet, the turning
point was 2003, for which more information was created than in all previous times. At about the same time, Google File System published the computational concept for MapReduce, which formed the basis for Hadoop. Doug Cutting (Doug Cutting)
worked on this tool for several years
as part of the Nutch project, and in 2006 Cutting joined Yahoo and Hadoop became a separate full-fledged solution.
We can say that big data made
it possible to create search engines in the form in which they exist now. Read more about this in the
article by Robert Krinjli (Robert X. Cringely) or its
transfer to Habré. Then big data really turned the industry around, allowing you to quickly search for the right pages. Another important point in the history of Big Data is 2008, when Nature gave large data a modern
definition as a set of special methods and tools for processing huge amounts of information and presenting it in a form understandable to the user.
Big data or big cheating?
In the modern perception and understanding of big data there is a big problem - due to the growing popularity of technology, it seems to be a panacea and a solution that any self-respecting company should implement. In addition, for many people, big data is
synonymous with Hadoop, and this leads some companies to think that if you process data with this tool, they immediately become large.
In fact, the choice of tool depends not so much on the
size of the data (although this may be important), but on the specific task. In this case, the correct formulation of the problem may show that it is
not at all
necessary to resort to using big data and that a simple analysis can be much more efficient in terms of time and money. Therefore, many experts “criticize” the Big Data phenomenon for the attention it attracts to itself, forcing many companies to follow the trends and apply technologies that are far from being necessary for everyone.
Another expectation is that big data is the
key to absolutely all knowledge. But the fact is that to extract information you need to be able to make the right requests. Bernard Marr, an expert in the field of big data,
believes that most of the projects on using Big Data fail because of the fact that companies cannot formulate an exact goal. The data collection itself today does not mean anything, their storage has become
cheaper than destruction.
Some even believe that Big Data can actually be called a big
mistake or a big
deception . A flurry of criticism
struck big data after the acclaimed failure of Google Flu Trends when the project missed the 2013 epidemic and distorted information about it by 140%. Then scientists from the Northeastern, Harvard and Houston universities criticized the tool, revealing that in the past two years of work, the analysis often showed incorrect results. One of the reasons is the change in the Google search tool itself, which has resulted in the collection of disparate data.
Often, analyzing big data
reveals connections between events that really couldn’t affect each other. The number of false correlations
increases with the amount of data being analyzed, and too
much data is as bad as too little. This does not mean that big data does not work, just besides computer analysis, it is necessary to involve scientists and specialists in a certain narrow field of knowledge who can figure out which data and analysis results are of practical value and can be used to predict something.
Big Data to the rescue
Certain problems exist in almost any field: incomplete data or their lack, lack of a single standard of record, inaccuracy of available information. But despite this, there are already many successful projects that really work. We have already talked about some cases of using Big Data in this
article .
Today there are several large projects whose purpose is to make the situation
safer on the roads. For example, Tennessee Highway Patrol, together with IBM, has developed an emergency prediction solution that uses data from previous accidents, arrests of drivers who are under the influence of alcohol or drugs, and event data. And in Kentucky, an Hadoop-based analytic system was implemented that uses data from traffic sensors, social media recordings and the Google Waze navigation application, which helps local administrations optimize snow removal costs and more efficiently use anti-ice tools.
Experts of Deloitte Center are confident that by 2020 big data
will completely
change the field of medicine: patients will know almost everything about their health thanks to smart devices collecting various information and will be involved in choosing the best possible treatment, and research by pharmaceutical companies will be released on a completely different level. With the help of big data and machine learning, you can create a learning health system that, based on electronic medical records and treatment results, can predict the response of a particular patient to radiation
therapy .
There is also a
successful experience of applying big data in the field of HR. For example, Xerox was able to reduce employee turnover by 20%, thanks to Big Data. Data analysis showed that people without experience, with high activity on social networks and with great creative potential, stay in one place of work for much longer. Such cases lead experts to believe that big data can be used to create an employer brand, select candidates, compose questions for interviews, identify talented abilities among employees and select employees for promotion.
Big data is also used in Russia, for example, Yandex has launched a service to
predict the weather, for which data from weather stations, radars and satellites are used. Moreover, there were even plans to use indicators of barometers embedded in smartphones to improve the accuracy of forecasts. In addition, many banks and the big three mobile operators
are engaged in big data. Initially, they used solutions only for internal purposes, but now, for example, Megafon is cooperating with the government of Moscow and Russian Railways. More information about the case VimpelCom (Beeline) can be
read on Habré.
Many companies realized the potential of data processing. But the real transition to big data is related to how all this information can be used for the benefit of business. Ruben Sigala, head of research at Caesars Entertainment, in an interview with McKinsey
says that the main difficulty in working with big data is to find the right tool.
Despite the fact that the awareness of the problem has come a long time ago, and the tools have existed and improved over the years, the search for the perfect solution continues today and can also be associated with the search for personnel, on which the results of big data analysis can depend much more.
PS What else do we write on the IaaS provider blog 1cloud: