📜 ⬆️ ⬇️

Big Data Brain

Probably, in the data world there is no such phenomenon of such an ambiguous understanding of what Hadoop is. No such product is not shrouded in such a large number of myths, legends, and, most importantly, misunderstanding on the part of users. No less mysterious and controversial is the term "Big Data", which sometimes you want to write in yellow font (thanks to marketers), but to pronounce it with special pathos. About these two concepts - Hadoop and Big Data, I would like to share with the community, and perhaps dissolve a small holivar.
Perhaps the article will offend someone, smile at someone, but I hope that it will not leave anyone indifferent.


image
Demonstration of Hadoop to users


Let's start with the origins.


The first half of the 2000s , Google: we made a great tool - a hammer, it nails well. This hammer consists of a handle and a striker, but we will not share it with you.


2006 , Doug Kayting: hello, folks, I made the same hammer here as Google did, and he really nails nails well, by the way, I tried to hammer in small screws with it and you won’t believe .


2010 , Paul 30 years old: Guys, the hammer works, even more, it perfectly closes the bolts. Of course, you need to prepare a little hole, but the tool is very promising.


2012 , Paul 32 years old: It turns out you can cut trees with a hammer, of course, it is a little longer than with an ax, but he, his mother, works! And for all this we did not pay a penny. We also want to build a small house with a hammer. Wish us good luck.


2013 , Doug: We have equipped the hammer with a laser sight - now you can throw it, the built-in knife will allow you to cut trees more effectively. All for free, all for the people.


2015 , Dan, 25 years old: I mow the grass with a hammer ... every day. It's a little hard, but I like the hell I like it, I like working with my hands!


If you really understand and dig a little deeper, then Google, and then Doug made the tool (and far from ideal, as admitted by Google, several years later), to solve a specific class of problems - building a search index.
The tool turned out quite good, but there is one problem, in other matters, about everything in order.


At the beginning of 2012, the aggressive trend began - “the era of big data”.
image


From that moment, useless articles began to appear, and even books in the style of "How to become a big data company" or "Big data decide everything." None of the conferences no longer did without reasoning about "what terabyte big data began with" and recurring stories about how "one company was almost on the verge of default, but it switched to big data and it just broke the market." All this idle chatter was fed by competent marketing from companies that sold support for all of this — sponsored hackathons, seminars, and many, many things.
In the end, a large number of people have a specific picture of the world in which traditional solutions are slow, expensive, and at least, this is no longer fashionable.
Many years have passed, but so far I have seen discussions and articles with the headings "Map Reduce: First Steps" or "Big Data: What Does It Really Mean?" on professional resources.


Hadoop as indexing tool


And so, what is Hadoop? In general terms, this is the HDFS file system and a set of data processing tools.
Yet this blog is technical, let me just have this picture:


image
Hadoop 2 Components


All this is spread over a cluster of "cheap iron" and, according to marketers, should in the blink of an eye fill you with money that will bring "big data".
Large Internet companies, such as Yahoo, at one time, appreciated Hadoop as a means of processing large amounts of information. Using MapReduce, they could build search indexes on clusters of thousands of machines.
I must say, then it was really a breakthrough - Open Source product can solve problems of this class and all this for free. Yahoo has made a bet on what is possible in the future, they would not have to grow specialists, and recruit from the ready.


But I don’t know when the first monkey came down from the tree, took the stick and started using MapReduce for data analytics, but the fact remains, MapReduce really began to appear where it is absolutely not necessary.


Hadoop MapReduce as a tool for analytics


If you have one large table, for example, user logs, then MR might be a stretch to count the number of rows or unique records. But this framework had fundamental flaws:
Each step of MapReduce generates a heavy load on the disks, which slows down the overall work. The results of each stage are discarded.
Initializing the "workers" takes a relatively long time, which leads to large delays, even for simple queries.
The number of "mappers" and "reducers" is constantly at runtime, resources are divided between these groups of processes and if, for example, mappers have already stopped their work, then resources to reducers will not be released.
All this works more or less efficiently on simple queries. JOIN operations of large tables will work extremely inefficiently - the load on the network.
Despite all this complex of problems, MapReduce deserved great popularity in the field of data analysis. When newbies begin their acquaintance with Hadoop, the first thing they see is MapReduce, “well, ok”, they say, “you have to learn.” In fact, the analytics tool is useless, but marketing has played a cruel joke with MR. The interest of users not only does not fade away, but is also fueled by beginners (I write this article in June 2016).
To analyze the interest in technology from the business, I decided to use HeadHunter.ru as the main platform for searching for job offers.
And you can also meet such interesting vacancies on HH.ru using the MapReduce keywords:
image


At the time of this writing, there were 30 vacancies in Moscow alone, and this is from respected and successful firms. I’d say right away that I didn’t analyze these proposals deeply, but there was still a positive dynamic, about a year ago there were more similar proposals.
Of course, the people who placed the vacancy could simply write something horribly and, perhaps, HeadHunter is not the best tool for such an analytics, but I could not find more suitable tools for measuring business interest.


Spark as a tool for analytics


Of course, smart people immediately realized that there was nothing to catch with MR and they invented Spark, who by the way also lives under the wing of ASF. Spark is MR on steroids, and as developers say, MapReduce is more than 100 times faster.


image
Spherical Spark in vacuum faster MapReduce


Spark is good because it lacks the listed disadvantages MR.
But we are already entering another level and the disadvantages reappear:
Hardcode and hard-working Java code turns simple queries into a mash that cannot be read in the future. SQL support is weak.
No cost optimization. This problem can be encountered when joining tables.
Spark does not understand how the data lie in HDFS. Although this is an MPP system, when connecting large tables, a situation arises when the data being joined is located at different nodes, which leads to a load on the network.
Although in general Spark is a good piece, it is possible that it will kill the labor market, as it is very hard to search for expensive Java or Scala specialists who will give you analytics hard work, especially if you are no-name-company (say with special pathos if you work in such).
Also, an interesting solution was born together with Spark - Spark Streaming and, perhaps, it will be really such a "long-lasting" solution.
Spark is simple, reliable and can be deployed without Hadoop.
Wait and see.
The job offer is slightly better than MapReduce, they are more mature and it seems they were written, plus or minus, by understanding people
image


The number of such proposals - 56 pieces.


image


And now a few myths about Hadoop and BigData


Myth 1. Hadoop is free
Nowadays, we use a lot of OpenSource products and don’t even think about why we don’t pay for them. Of course, free cheese comes only in a mousetrap and you have to pay, after all, especially for Hadoop.
Hadoop and everything connected with it is actively marketed under the flags of free, peace and fraternity. But in reality, not many would risk using their own Hadoop assembly — the product is rather raw and still incomprehensible by many.
The company will have to hire expensive professionals, while they will solve the tasks longer and harder. In the end, instead of solving data processing tasks, employees will solve problems of patching holes in raw software and building crutches.
Of course, it does not concern other mature OpenSource products, such as MySQL, Postgres, etc., which are actively used in combat systems, but even here, many companies enjoy paid professional support.
Before deciding whether you need a free product, find out if it is free. It is possible that yesterday’s students in the modern combine and a group of expensive Java coders, with free hammer hammers, will cope with your tasks of collecting grain from the fields with the same success.
Ok, Hadoop, it's not free, let's say, but Hadoop works on cheap hardware! And again by. Hadoop, although it works on cheap hardware, for fast and reliable problem solving, you still need normal servers - this won't work on desktops. To work properly, Hadoop will require hardware of the same class as any other analytical MPP systems. On Cloudera's recommendation
Depending on the tasks, you must:


  1. 2 CPUs with 4-8-16 Cores
  2. 12-24 jbod drives
  3. 64-512GB of RAM
  4. 10 Gbit Net

Please note that there is no RAID, but Hadoop redundancy at the software level requires approximately the same number of disks.


Myth 2. Hadoop for processing unstructured information.
Another no less remarkable myth tells us that "Hadoop is necessary for processing unstructured information", and such unstructured information is Big Data :-). But let's see first what is unstructured information.
Tables are precisely structured information, it is indisputable.
But JSON, XML, YAM - called semi-structured information.
But such formats have a structure, but not as explicit as the structure of the tables.
Another hot topic - logs, in the opinion of BigData popularizers, does not have a structure.



In fact, there is a structure, the logs themselves are normally recorded in tables and processed without MapReduce.


Twitter:



In fact, the structure has almost all the data that we can come in handy. It may be scattered, not convenient for processing, but it is.
Even such data, for example, video or audio information can be presented in the form of a structure that can be distributed over a large number of servers and processed.


Video files:



Most likely, where you work, there is no unstructured information. And your information may be scattered and "dirty", but it still has some kind of structure. In this case, you really have problems and you need to solve them first.
Of course, there is information that cannot be effectively “spread out” across a large cluster, for example, genetic information or a huge file archive, but such cases are extremely small and are not interesting for “business intelligence”, such problems are solved by other means on a completely different level (if know, tell).
If you know some really unstructured sources of information that cannot be simply processed in a distributed cluster, please write in the comments.


Myth 3. Any problem is solved through Big Data technology.
Another interesting term imposed on society is Big Data technology. Of course, there is no logical definition of what Big Data is, all the more there is no definition of “Big Data technologies”.
It is generally accepted that everything related to Hadoop is “Big Data technology”
image


But Hadoop and everything connected with it, very well disguised, neat super-functional Swiss knife-hammer. They can cut down trees, mow the grass, drive in bolts. He copes with all the tasks, but only when it comes to solving a specific task, especially when you need to do it efficiently, such a Swiss hammer knife will only complicate your life.


Impala, Dill, Kudu - new players


Of course, even smarter people than everyone else, looked at this whole mess and decided to create their own amusement park.
Three animals Impala, Drill and Kudu appeared at about the same time and not quite a long time ago.
These are the same MPR engines on top of HDFS as Spark and MR, but the difference between them is the same as between food and snacks - huge. Products are also under the wing of the highly respected ASF. In principle, all three projects can be used now, in spite of the fact that they are at the stage of the so-called "incubation".
By the way, Impala and Kudu are under the wing of Cloudera, and Drill left the company Dremio.
Of all the menagerie, I would single out Apache Kudu as the most interesting tool presented with a clear and mature roadmap.
The benefits of Kudu are as follows:
Kudu understands how data is stored in HDFS and understands how to put it correctly in HDFS in order to optimize future queries. Distributed by directive
Only SQL and no hardcode.
Of the obvious flaws, we can highlight the absence of a Cost-based optimizer, but it is treated and perhaps in future releases we will show Kudu in all its glory. All these 3 products, plus or minus, are about the same, so let's look at the architecture using the example of Apache Impala:
image


As we see, there are instances of the DBMS - Impala, which are already working with data on their particular node. When a client connects to one of the nodes, it becomes the manager. The architecture is quite similar to Vertica, Teradata (high level and very close). The main task when working with such systems is to ensure that the data on the cluster is “blurred” correctly in order to work effectively with them in the future.
For all its merits, developers promote their systems as "federative", that is: take the Kuda table, link it with a flat file, mix it all with Postgres and spice up MySQL. That is, we have the opportunity to work with heterogeneous sources as with ordinary tables or non-relational structures (JSON) as with tables. But this approach has its price - the optimizer does not understand the statistics of external sources, such external tables also become a narrow neck when executing queries, since, in fact, they work in a "one thread".
Another important point is the need for HDFS. HDFS in such an architecture turns into a useless appendix, which only complicates the work of the system - an extra layer of abstraction, which has its own overhead. Also, HDFS can be deployed on top of not very efficient or not properly configured file systems, which can lead to the fragmentation of data files and loss of performance.
Of course, HDFS can be used as a garbage can of everything and everything, throwing into it all the necessary and unnecessary. This approach is recently called "Data Lake", but do not forget that it will be more difficult to analyze unprepared data in the future. Followers of this approach argue the advantages that the data may not have to be analyzed, therefore, there is no need to waste time on their preparation. In general, it’s up to you to decide which way to go.
There are no proposals on the work and interests of companies towards Kudu-like products, but in vain.


Little marketing


You probably noticed a clear trend towards the fact that this whole circus in the field of data analytics is moving towards traditional analytical MPP systems (Teradata, Vertica, GPDB, etc.).
All analytical MPP-systems are developing in the same direction, only with the two different groups go to this from different sides.
The first group goes along the path of "sharding" traditional SQL DBMS.
The second group is based on MR and HDFS pedigree.


image
Users are interested in Hadoop


The avalanche growth of Hadoop is of course due to the very competent marketing by the companies selling these solutions.
Companies were able to grow in people's minds the idea that Hadoop is free, it is simple and fast and easy, and yet ... there is no god other than Hadoop.
The pressure was so strong that even Teradata could not cope with itself and instead of forming the market itself, began selling solutions based on Hadoop and hiring specialists. Not to mention other market players who unanimously gave birth to a product called "AnyDumbSoft Big Data Edition", in most cases using standard HDFS connectors.
Even Oracle, which released the "Big data appliance" or "Golden Gate for Big data", succumbed to the trend. The first product is just a finished piece of hardware with a "golden" CDH from Cloudera, and Java connectors for Kafka (message broker), HBase and the rest of the zoo are simply added to product number two. Any user could do this on their own.
image
Big data sick person


Unfortunately, this is a trend, this is the mainstream, which will sweep away any stable company if it risks going against the current. By the way, I’m also partly risking being thrown at tomatoes, highlighting this topic.


Apache HAWQ (Pivotal HDB).


Pivotal went the furthest. They took the traditional Greenplum and pulled it on HDFS. The entire data engine was left behind Postgres, but the data files themselves are stored in HDFS. Some practical expediency in this is not enough.
You get at the disposal of the same Greenplum, with more complex administration, but sell it to you and advertise it as Hadoop.
Apache HAWQ is very similar to Apache Kudu.


Cloudera Distributed Hadoop


Cloudera was one of the first companies to start monetizing Hadoop, and it is there that Doug who invented Hadoop works.
Cloudera, unlike other players, does not adapt to the market, but makes it itself. Competent PR and marketing allowed her to conquer a rather tasty morsel of the market - now there are more than 100 large and well-known companies in the list of clients.
Unlike other similar companies, Cloudera not only sells a zoo from ready-made components, but also actively participates in their development.
For the price, the CDH comes out a bit cheaper than Vertica / Greenplum.
But despite the large number of success stories on the Cloudera website, there is one small problem - Kuda, Impala - some raw, products at the incubation stage. Even when they mature, these systems will have to go a long way to get all the necessary functionality Vertica or at least Greenplum, and this is not a year or two, while CDH can be left for hipsters.
It is also necessary to pay tribute to the marketers of Cloudera, who managed to shake up the market.


Future hadoop


Let me swagger and imagine what will happen to the Hadoop stack in 5 years.
MapReduce will be used only in a very limited number of tasks, the project is likely to be cut out from the general stack, or it will be forgotten.
The first CDH distributions will appear, with partial abandonment of the use of HDFS. In this case, the table files will be stored on a regular file system, but we will have a small dump for storing raw data.
You can draw an analogy with the Flex Zone in Vertica - a dump into which you can throw anything and process further as needed or forget.
In fact, having such a dustbin is not only convenient, but we will just have to have it. Disk space grows disproportionately fast compared to the performance of processors. When the number of nodes in a cluster is increased, for performance purposes, we increase the amount of disk space (more than necessary). As a result, there will always be a large amount of unallocated disk space, in which it is convenient to store data to which access will be either very rare, or we will never turn to them.


Zoo named Hadoop is unlikely to justify the credibility, which gave him users, but I hope that will not leave the market.
At least, from the interests of competition.


image
Will Hadoop have any problems in 5 years?


What will happen to Spark? Perhaps many will use it as an engine for distributed preprocessing and real-time data preparation - Spark Streaming, but this niche is also actively involved in other players (Storm, manufacturers of ETL)


Future Vertica, Greenplum.


Vertica will polish its integration with HDFS, increase its functionality and Vertica will most likely not go to OpenSource - now the product is selling so well.
Greenpum will make its own Flex Zone analogue, by merging the code with HAWQ, or it will become a non-HDFS part of HAWQ, in the end, we will lose someone.
Some kind of new players in the market for analytical MPP systems will most likely not be expected. The discovery of the Greenplum source codes makes it advisable to use such DBMS as Postgres-XL, at least, into question.
We will hardly see any fundamental changes in the architecture of these products, there will be changes in improving the existing functionality.


The future of Postgres-XL and the like


Postgres-XL could be a great MPP tool for analyzing large amounts of data if it were a little away from everything Postgres gave it. Unfortunately, the DBMS does not know how to work with Column Store-tables, it has no normal syntax for managing partitions, and it also has the standard Postgres optimizer with all the consequences.
For example, in Greenplum there is a cost-based optimizer sharpened for analytical queries. This is the thing without which the life of the analyst and developer will become very complicated.
But it’s also not worthwhile to put an end to such a wonderful product, Postgres is developing, multithreading has already appeared in 9.6 and maybe craftsmen will screw the Column Store and GPORCA into Postgres-XL.


The future of Teradata, Netezza, SAP and the like


In any case, the market for analytical systems will grow, and in any case, customers for these products will be. I don’t know whether these solutions will be sold on golf courses or at conferences “Big Data - Future Technology”.
But most likely, these players will have to get away from the current business model of software and hardware and look towards Only-Software products.
They can’t jump into the ghost train “Big Data”, but they don’t need it, because the train is imaginary and they came up with it in part.


Future of Redshift, BigQuery, and Cloud Analytics Services


At first glance, cloud services look very, very attractive: no need to bother buying equipment and licenses. It is understood that, if desired, it will be possible to easily abandon the service or move to another.
On the other hand, analytics is a long-term project, and to develop an analytical repository, abstracting from a specific technology is very, very difficult. Therefore, in the future it will be difficult to move from one cloud storage to another without serious consequences.
Clients for such players will definitely be, but very specific - start-ups and small companies.


Summary: I have not touched a large number of products from the ASF menagerie who sell under the Big Data sauce (Storm, Sqoop, etc.), as long as there is little interest in them both from my side and the market as a whole. Therefore, I will welcome any comments regarding these products.
Also, I did not touch the topic of clickstream analytics, which is gaining momentum. I hope I will describe it in the following articles.


The second summary: It’s hard not to go along with the "creators" of the market when choosing solutions in the field of data processing and analysis. Until now, the dust has not settled, and we will still encounter companies selling "happiness" and we will encounter products positioned as a "universal medicine" for the Big Data brain.
I tried to show where Hadoop is developing, and indeed the entire data processing industry. I tried to dispel several myths against Big Data and tried to imagine in which direction the whole area would develop. I hope it turned out - we will know about it in a few years.
In the end, the market develops and becomes more accessible to the consumer, new products appear, new ones appear or old technologies are reborn.


')

Source: https://habr.com/ru/post/303802/


All Articles