What is special about Cloudera and how to cook it

The market for distributed computing and big data, according to statistics , is growing by 18-19% per year. So, the question of choosing software for these purposes remains relevant. In this post we begin with why distributed computing is needed, we’ll discuss in more detail the choice of software, tell you about using Hadoop with Cloudera, and finally talk about choosing hardware and how it affects performance in various ways.

Why do we need distributed computing in ordinary business? Everything is simple and difficult at the same time. Simply - because in most cases we perform relatively simple calculations per unit of information. It is difficult - because there is a lot of such information. Lots of. As a result, it is necessary to process terabytes of data in 1000 streams . Thus, usage scenarios are fairly universal: calculations can be used wherever a large number of metrics need to be taken into account on an even larger data array.

One recent example: the Dodo Pizza pizza chain determined, based on an analysis of customer orders, that when choosing a pizza with an arbitrary filling, users usually operate with only six basic sets of ingredients plus a pair of random ones. In accordance with this pizzeria rigged purchases. In addition, she managed to better recommend additional products to users that were offered at the order stage, which made it possible to increase profits.
')
Another example: the analysis of commodity positions allowed the H & M store to reduce the range in individual stores by 40%, while maintaining the level of sales. This was achieved by eliminating poorly selling positions, and in the calculations seasonality was taken into account.

Tool selection

Industry standard computing of this kind is Hadoop. Why? Because Hadoop is an excellent, well-documented framework (the same Habr issues many detailed articles on this topic), which is accompanied by a whole set of utilities and libraries. You can submit to the input huge sets of both structured and unstructured data, and the system itself will distribute them among the computing powers. Moreover, these same powers can be increased or deactivated at any time - that same horizontal scalability in action.

In 2017, the influential consulting company Gartner concluded that Hadoop would soon become obsolete. The reason is quite banal: analysts believe that companies will begin to migrate en masse to the cloud, as there they will be able to pay on the fact of using computing power. The second important factor supposedly capable of “burying” Hadoop is the speed of work. Because options like Apache Spark or Google Cloud DataFlow are faster than MapReduce, the underlying Hadoop.

Hadoop rests on several whales, the most notable of which are MapReduce technology (a data distribution system for calculations between servers) and the HDFS file system. The latter is specifically designed to store information distributed between cluster nodes: each block of a fixed value can be placed on several nodes, and replication ensures the stability of the system to failures of individual nodes. Instead of a file table, a special server called NameNode is used.

The illustration below is a map of the MapReduce. At the first stage, the data are separated according to a certain sign, at the second - distributed over the computing power, at the third - the calculation takes place.

Initially, MapReduce was created by Google for the needs of its search. Then MapReduce went into free code, and Apache took over the project. Well, Google gradually migrated to other solutions. An interesting nuance: at the moment Google has a project called Google Cloud Dataflow, positioned as the next step after Hadoop, as a quick replacement.

Upon closer inspection, it can be seen that Google Cloud Dataflow is based on the Apache Beam version, while the Apache Beam includes a well-documented Apache Spark framework, which makes it possible to speak of almost the same speed of decision making. Well, Apache Spark works fine on the HDFS file system, which allows you to deploy it on Hadoop servers.

Add here the volume of documentation and ready-made solutions for Hadoop and Spark against Google Cloud Dataflow, and the choice of tool becomes obvious. Moreover, engineers can decide for themselves which code, under Hadoop or Spark, they can perform, focusing on the task, experience and qualifications.

Cloud or local server

The tendency to a universal transition to the cloud gave rise to even such an interesting term as Hadoop-as-a-service. In such a scenario, the administration of connected servers has become very important. Because, alas, despite its popularity, pure Hadoop is quite a complicated tool for tuning, since a lot of things have to be done by hand. For example, individually configure servers, monitor their performance, carefully configure many parameters. In general, the work of an amateur and there is a big chance somewhere to screw up something or miss something.

Therefore, various distributions, which are initially equipped with convenient deployment and administration tools, have become very popular. One of the most popular distributions that support Spark and simplify everything is Cloudera. It has both paid and free versions - and in the latter all main functionality is available, without limiting the number of nodes.

During setup, Cloudera Manager will connect via SSH to your servers. An interesting point: during installation, it is better to indicate that it is carried out by so-called parsels : special packages, each of which contains all the necessary components that are configured to work with each other. In fact, this is such an improved version of the package manager.

After installation, we get the cluster management console, where you can see telemetry on clusters, installed services, plus you can add / delete resources and edit the cluster configuration.

As a result, you see the cabin of that rocket, which will take you into the bright future of BigData. But before you say "let's go," let's move under the hood.

Iron requirements

On its website, Cloudera mentions various possible configurations. The general principles on which they are built are shown in the illustration:

MapReduce can smear this optimistic picture. If you look at the diagram from the previous section again, it becomes obvious that in almost all cases the MapReduce task may encounter a bottleneck when reading data from a disk or network. This is also noted on the Cloudera blog. As a result, for any fast calculations, including through Spark, which is often used for real-time calculations, I / O speed is very important. Therefore, when using Hadoop, it is very important that balanced and fast machines fall into the cluster, which, to put it mildly, is not always ensured in the cloud infrastructure.

The balance in load distribution is achieved through the use of Openstack virtualization on servers with powerful multi-core CPUs. Data nodes allocated their processor resources and certain disks. In our solution, Atos Codex Data Lake Engine achieves extensive virtualization, which is why we gain both performance (minimizing the impact of network infrastructure) and TCO (eliminating unnecessary physical servers).

In the case of using the BullSequana S200 servers, we get a very uniform load, devoid of some of the bottlenecks. The minimum configuration includes 3 BullSequana S200 servers, each with two JBODs, plus optional S200s, each containing four data nodes, are optionally connected. Here is an example of the load in the TeraGen test:

Tests with different data volumes and replication values show the same results in terms of load distribution between cluster nodes. Below is a graph of the distribution of disk access according to performance tests.

The calculations are based on the minimum configuration of 3 BullSequana S200 servers. It includes 9 data nodes and 3 main nodes, as well as reserved virtual machines in case of deployment of protection based on OpenStack Virtualization. TeraSort test result: a block size of 512 MB replication coefficient of three with encryption is 23.1 minutes.

How can I expand the system? Various types of extensions are available for Data Lake Engine:

Data nodes: for every 40 TB of usable space
Analytical nodes with the ability to install a graphics processor
Other options depending on the needs of the business (for example, if you need Kafka and the like)

The Atos Codex Data Lake Engine complex includes both the servers themselves and the pre-installed software, including the Cloudera kit with a license; Hadoop itself, OpenStack with virtual machines based on the RedHat Enterprise Linux kernel, data replication and backup systems (including using backup nodes and Cloudera BDR - Backup and Disaster Recovery). Atos Codex Data Lake Engine was the first virtualization solution that was certified by Cloudera .

If you are interested in the details, we will be happy to answer our questions in the comments.

Source: https://habr.com/ru/post/451772/

All Articles

What is special about Cloudera and how to cook it

Tool selection

Cloud or local server

Iron requirements

More articles: