Big data is usually understood as a series of approaches, tools, and methods for processing structured and unstructured data, which are distinguished by huge volumes and considerable diversity. The purpose of this treatment is to obtain perceptible results.
The data stream can come from different sources, this data is heterogeneous and is transmitted in various formats: text, documents, images, videos and more. To extract useful information from such data, the software and hardware platform is of decisive importance.
"Standard" platform for big data: system architecture
Usually, developers of solutions for big data seek to combine the data processing capabilities of various nature. Often the role of full-featured platforms for working with big data is played by traditional platforms adapted to new conditions. In other cases, suppliers offer specialized integrated solutions to enterprises.
A common approach is to use the Hadoop platform. It works on the principle of moving computations closer to the data storage location: processing is usually performed on large server clusters created using standard hardware. Combining the Hadoop platform with standard servers is the foundation for a cost-effective and high-performance analytic platform for parallel application operation.
')
A typical platform for big data is a cluster of identical nodes, usually represented by standard dual-processor servers and associated with each server storage system.In terms of economy, dual-processor servers are the best option for most Apache Hadoop workloads. Typically, these servers are more efficient for distributed computing environments than multiprocessor platforms. However, they do not always provide sufficient performance, and in other situations it is, on the contrary, redundant. Some workloads, such as simple data sorting, do not require the power of Intel Xeon processors. In such cases, it is more rational to perform such light workloads on microservers. Other tasks, on the contrary, require significant computational power and the use of "accelerators". Moreover, usually the failure of one node in such architectures requires considerable time for the redistribution of data in the system.
Migrating clusters to handle big data for different tasks takes significant resources and time. And it has already become a headache for IT departments in some Russian companies that actively use big data technologies. HPE was able to solve the problem and offer a beautiful, effective and flexible solution.
The reference architecture of the big data platform
The HPE Big Data Reference Architecture (BDRA) with optimized performance offered by HPE developers is designed to create flexible, high-performance and rapidly deployable Hadoop-based solutions. In its composition - compute nodes, combined with storage resources. Unlike the architecture described above, BDRA uses a more flexible approach with load optimization and a modern network architecture. As a result, the platform for consolidating, storing and processing big data has turned out to be scalable, easy to deploy and use.
As noted above, usually general purpose servers are used as nodes of a Hadoop cluster. But what if you use HPE Moonshot compact servers for calculations, and the storage function is assigned to devices with a large number of hard disks - HPE Apollo 4500 or HPE Apollo 4200? What if the data is not stored on local media, but on external device drives connected to the compute nodes by a high-speed Ethernet network?
This is exactly what the HPE developers did. In addition to the obvious benefits - cost savings and easier system management - a significant increase in productivity (read / write) has been achieved, confirmed by various tests.
In collaboration with
Cloudera and
Hortonworks , joint solutions were created based on the HPE Big Data Architecture.
Load optimization
Most modern Hadoop systems use the Hadoop Distributed File System (HDFS) file system with high access speed and low latency. In 2012, Hadoop YARN (Yet Another Resource Negotiator) improved management and improved utilization of HDFS resources. In YARN, the so-called containers (RAM resources, CPU, and network bandwidth) are used to specify the resources available to an application. With this approach, there is a "horizontal separation" of the computing resources of individual nodes between applications.
YARN is a key Hadoop tool. It can be described as a distributed operating system for big data applications. The appearance of the concept of labels in YARN allows you to group compute nodes and send tasks to a specific group. Thereby it is possible to optimize the load.
HPE offered "vertical resource sharing" - you can send tasks to sites that are optimized for a particular load. For example, tasks of Hadoop MapReduce - on universal nodes, Hive - on energy-efficient nodes with low-voltage processors, and Storm - on nodes with accelerators. For automatic distribution of tasks they are assigned tags.
Instead of SAN, Hadoop uses the concept of software-defined storage systems (Software-Defined Storage, SDS) with a distributed file system on standard storage nodes. At the same time, Hadoop Distributed File System (HDFS) is usually used to work with files, and Ceph is used when working with objects. In addition, Hadoop, if required by the load, can now support the tearing - the automatic distribution of data across storage levels. Such levels are SSD, HDD, RAM, or archive object storage.
The asymmetric architecture of the Big Data Reference Architecture includes heterogeneous computing and storage nodes. Computing nodes can be represented by low-cost energy-efficient modules, modules with graphics accelerators (GPU), programmable processors (FPGA), or with increased RAM. Storage nodes can use SSD or hard drives, archive systems. Optimizing nodes for workload accelerates the execution of various applications for working with big data.What are the benefits of this Hadoop cluster architecture?
- A cluster can be represented as an integrated solution that includes computing resources integrated by the integrated network infrastructure;
- A built-in network factory allows for increased traffic intensity in the east-west direction. As a result, the throughput of the entire cluster increases, switching becomes more intelligent;
- The use of optimized for types of load nodes with specialized CPUs and graphics processors, "servers on a chip" (server-on-chip, SOC) increases productivity, computational density and energy efficiency;
- Hyper-scaling features. For example, a single HPE Moonshot chassis can act as a cluster of 45 nodes;
- The ability to build clusters with heterogeneous computing nodes, which is in demand for the problems of deep analysis (Deep Learning) and neural networks.
For processing intensive computational loads, such as Apache Spark, high-performance computing (HPC) nodes can be included in the same rack.
The HPE solution is an asymmetric Hadoop cluster with optimized nodes.Thus, HPE abandoned the traditional paradigm and demonstrated that independent computing and storage components in a Hadoop cluster allow you to create very fast asymmetric systems. This solution can be improved by optimizing the load and using such tools as YARN, tearing and assigning tasks to specialized nodes.
HPE BDRA schematic diagram: The 42U rack combines compute modules with integrated switches, data storage modules, and control modules.The HPE BDRA reference architecture allows you to consolidate disparate data pools, combine them into a single pool, with which you can work using Hadoop, Vertica, Spark, and others. The possibility of flexible adaptation to future loads is embedded in the architecture itself. In this convergent asymmetric cluster, storage resources are divided into levels. SAN is not used — it is replaced by direct access storage (DAS). Loads and storage resources are assigned to nodes optimized for the respective tasks. As an interconnect, standard Ethernet is used with native Hadoop protocols for exchanging computing and storage resources, such as HDFS and HBase.
As a concept, HPE BDRA helps to significantly improve the price / performance ratio and computational density compared to traditional Hadoop architecture. Thanks to modern Ethernet-factories in the system does not form bottlenecks when exchanging data between the server and the storage subsystem. As the testing shows, the reading performance in HPE BDRA is 30% higher than the average Hadoop cluster.
The composition of the decision
HPE BDRA is based on the following HPE technologies:
HPE Apollo 4200 Gen9 Storage Site.The HPE Apollo 4510 site can store up to 544 TB of data. It is recommended to use for storing backup copies or archive.Storage nodes - HPE Apollo 4200 Gen9 servers form a single HDFS storage. As an option, the HPE Apollo 4510 System can be used - a high-performance, high-density storage system. It plays the role of backup / archive storage.
HPE Moonshot System chassis with HPE ProLiant m710p Server Cartridge server “cartridges”.Compute Nodes — The HPE Moonshot System High Density System serves computational tasks and load optimization. HPE ProLiant m710p Server Cartridge or HPE ProLiant XL170r Gen9 servers can serve as compute nodes.
HPE BDRA components: compute nodes, storage nodes and high-speed network.Configuration flexibility and scaling
Computing and storage nodes BDRA connects to a high-speed network. It turns out asymmetric architecture, where each level can be scaled individually. The ratio of processors and storage resources is not strictly defined. Since there is no mutual binding of these resources, you can take advantage of the many advantages of a converged architecture. For example, scale them independently by simply adding the appropriate nodes to the system. HP testing shows that the load reacts almost linearly. In addition, you can choose the configuration with one or another ratio of resources under the type of load.
Independent scaling of computational and storage resources: you can choose the configuration by “hot” (computational) and “cold” loads. In the first case, the proportion of computing nodes increases, in the second case, the storage resourcesIn HPE BDRA, YARN-compatible tools, such as HBase and Apache Spark, can directly use HDFS storage. Others, such as SAP HANA, require appropriate connectors to access the data.
HPE BDRA has been tested with the Moonshot 1500 chassis and the latest Moonshot server cartridges. This solution allows to obtain a high computational density. The Moonshot 1500 chassis with
ProLiant m710p server cartridges connects to external switches with eight Direct Attach Copper (DAC) cables, 40GbE each.
Switch HPE FlexFabric 5930 Switch.HPE BDRA uses
HPE FlexFabric 5930 Switch Top of Rack (ToR)
switches , configured through the HPE Intelligent Resilient Framework (IRF), as
network switches . The optional HPE 5900 Switch is used to connect to the HPE Integrated Lights-Out (HPE iLO) control ports via 1GbE.
HPE Moonshot server chassis.The
Moonshot System chassis contains two 45-port 10GbE switches for serving the internal network. Each switch connects to the external infrastructure by four uplinks 40GbE. HPE Apollo 4200 System, HPE SL4540 and Apollo 4510 System connect to high-performance ToR switches through a pair of 40GbE ports.
It is also important that in BDRA it is possible, if necessary, to scale the required level - computational or storage - without additional costs and redistribution of data. Data consolidation in BDRA allows you to avoid storing unnecessary copies and minimize data movement.
Open standards
HPE BDRA supports a variety of data management tools, including Hadoop, Spark, HBase, Impala, Hive, Pig, R and Storm. Thanks to centralized storage and the use of tools such as YARN tags, this solution provides access to data (direct or through connectors) and is the right platform for current and future enterprise applications.
One of the key benefits of HPE BDRA is open standards, the implementation of Open Source. This makes it possible for other vendors to work with the HPE solution. For example, they can use their designs to optimize the loads in HPE BDRA. So Mellanox created its own hardware accelerator for its network card. This technology is integrated into the Moonshot cartridge.
What is the result?
Let us list once again the advantages of the HPE BDRA solution. The most obvious ones are density and price / performance ratio. Others include:
• Elasticity. The HPE BDRA architecture has been designed for maximum flexibility. You can flexibly allocate tasks to compute nodes without redistributing data; you do not need to respect the ratio of storage resources and computing resources. You can grow the system, scale it. YARN compliant loads get direct access to large data via HDFS, while others can access the same data through appropriate connectors.
• Data consolidation. The HPE BDRA architecture is based on HDFS. And HDFS has sufficient performance and capacity to serve as a single source of data in any organization.
• Load optimization. To work with big data a set of management tools is used. After selecting the appropriate tool, you can run the task on the node that best fits the given load.
• Improved storage capacity management. Computing nodes can be assigned dynamically, on the fly, and managing a single repository reduces costs.
• Fast result. Typically, working with big data requires several management tools. In HPE BDRA, data is not fragmented. They are consolidated into a single “data lake” and tools for working with them can access the same data through a YARN or connector. As a result, more time is spent on analysis and less time on data delivery. The result is faster.
HPE BDRA is a reference architecture, but customers are offered documents on its implementation (Bills of Materials, BOM) in specific solutions. You can deploy such a system yourself, by the customer, or use the services of HPE Technical Services or authorized partners. HPE BDRA is “customized”: component configurations are tailored to customer needs.