📜 ⬆️ ⬇️

From experiment to product: Hadoop and Big Data

Today we will discuss the topic of cloud infrastructure and the integration of Big Data capabilities into traditional IT systems. The starting point of this review was the notes of a number of experts in the industry.


/ photo xdxd_vs_xdxd CC

The term “Big Data” appeared not so long ago - it was first used in Nature magazine in 2008. In that issue (September 3), it was suggested that readers should call a set of special methods and tools for processing huge amounts of information and present it in a form that is understandable to the user.
')
Apologists of this direction even claim that the tandem of powerful modern technologies and “powerful” volumes of information available in the digital age promise to become a formidable tool for solving almost any problem. “We just need to collect and analyze the data.”

In the case of Andrew Warfield, who works as a Coho Data CTO and is an associate professor of computer science at the University of British Columbia, this is a study based on interviews with representatives of companies from the Fortune 500 list. Andrew was interested in the practical aspects of using Big Data tools to understand how large companies integrate huge amounts of data with an existing IT infrastructure. In particular, he was interested in potential “pain points” and data storage features.

The Andrew study showed the presence of many small clusters such as Big Data. The leaders of IT companies called the process of the emergence of such clusters "analytics sprawl", which became a relatively simple and mediocre part of the business due to the technologies of Cloudera CDH, Docker and Hortonworks.

Today, the challenge is to combine multiple analytic tools into one well-documented and easily manageable environment that can be extended with other useful tools. An example of this is ETL customization (extract, transform, load - extract, transport, load), that is, upload data from existing resources and analytical tools ( H2o , Naiad and others).

This infrastructure is deployed separately from traditional IT systems, and the data is copied from enterprise storage to HDFS. After processing, the data is copied from HDFS back to the main system. This separation leads to inefficient waste of resources and an increase in the operating costs of the business.


/ photo by Philip Kromer CC

Often, big data is perceived as an " experimental research project ." This is not a negative description, but it explains the tendency to create separate isolated infrastructure clusters for working with big data.

So, Michael Jordan, an honorary professor at the University of California at Berkeley, says that big data can be another trick to the media that thousands of researchers have peaked around the world. Modern obsession with big data can lead to uncontrolled use of conclusions drawn from data with controversial statistical strength.

The problem may lie in the notorious "human factor" - not every analyst can work effectively in this direction. Ricardo Vladimiro (Miniclip employee) believes that in order to really dive into the study of data, a person must be well versed in statistics and probability theory, as well as be able to conduct experiments and test their hypotheses, visualize data.

Now one of the main tasks of IT teams of large companies is the question of how to make big data become a reliable and reproducible “product” (not to mention efficiency and cost), such as storage, virtual machines, databases and other infrastructure services these days.

The difficulty lies not in choosing a cloud platform for working with Big Data. The fluidity of the stack of related technologies and the predominance of the “artisanal” approach to the development of systems (with the “doping” of finished products with their hands) will not change in the near future. In order for the “science project” to transform itself into a viable and efficient business solution, a rethinking of basic IT infrastructure solutions will be required.

PS We try to share not only our own experience on the service of providing virtual infrastructure 1cloud , but also to talk about related areas of knowledge in our blog on Habré. Do not forget to subscribe to updates, friends!

Source: https://habr.com/ru/post/259433/


All Articles