
Hadoop was created to run on widely used computer chips with a low-speed network connection. But Hadoop clusters have become larger and organizations have exceeded the power limit. To solve the problem, specialized solutions were found, such as solid state drives and InfiniBand networks, which have a growth margin.
InfiniBand was introduced to the world in 2000 as a network protocol that was faster than TCP / IP — the original network protocol on Ethernet networks. Through the use of Remote Direct Memory Access (RDMA), InfiniBand allows you to directly write / copy data from the memory of a remote computer, bypassing the operating system and possible delays.
')
You can get 40 gigabit / sec of bandwidth at the InfiniBand QDR port (Quad Data Rate), which is now the most widely used. This is 4 times the standard channel width of 10Gigabit Ethernet (10GbE). You get an increase in speed during port aggregation (including Ethernet).
The Ethernet protocol initially occupied a dominant position when choosing for the vast majority of enterprise networks. At this time, InfiniBand gradually took root in the market of high-performance computer systems, the super speed of InfiniBand and the absence of delays gave an advantage to this protocol in large arrays of parallel clusters. More than 50% of supercomputers from the TOP 500 last year use InfiniBand. This protocol is loved by high-speed exchanges, various financial services, and other large users of InfiniBand ports; the most commonly used equipment is InfiniBand made by
Mellanox and
Intel .
But when the InfiniBand protocol was implemented on Hadoop, it showed lower performance compared to InfiniBand performance among supercomputers. There are several reasons for this. Many solutions have been positioned towards Ethernet in most enterprise markets. The perception of InfiniBand as something exotic and very expensive also played a role (which is not always true).
This is not news that Hadoop's InfiniBand is far from mainstream. All distributors of the Hadoop platform have support from Hewlett-Packard, IBM and Dell. These vendors also support InfiniBand in their respective deployments. If you look at what implementations are built that use about 20% of Hadoop integrators, you will find out that both Oracle and Teradata support InfiniBand
Why choose InfiniBand
A few interesting points that you should pay attention to when using InfiniBand through 10GbE. There is one person who can see from the inside many points of using InfiniBand-on-Hadoop. This is Panda Dabalesvor, a professor at the Ohio University School of Computer Science and Engineering and the head of the
Network-Based Computing Research Group research group.
High-Performance Data Processing Systems (HiBD) Architecture for InfiniBand Libraries for Distributed Hadoop File Structure (HDFS)Panda manages the HiBD project at Ohio University, where he develops, develops and supports the library project for Hadoop versions 1 and 2 (HDFS and MapReduce). They support native RDMA, which is used to exchange data in InfiniBand. Apache Hadoop and Hortonworks are now supported in a plugin for Cloudera. The researchers also wrote code to support InfiniBand in a database that supports memory caching. This code also works in libraries that support Apache Spark and HBase.
Panda, who has been researching in the environment of switched supercomputers for 25 years and who has worked with InfiniBand since it appeared, confirms that InfiniBand is not very common in the Hadoop platform environment, but he expects this to change in the near future.
“There was a technological breakthrough among supercomputers. But in the enterprise environment, they have some drawbacks, ”Panda told Datanami portal. "From tz. Enterprise they caught up with him. So we need to wait 1-2 years to see a wider use of InfiniBand so to say “among the people”.
Since the HiBD architecture integrated the first InfiniBand library several years ago, the package has been downloaded more than 11,000 times. According to the
website of the group, it is used by more than 120 organizations around the world.
He also noted that the general trend in all implementations of InfiniBand is the desire to achieve maximum scalability and performance while avoiding tight necks in the input-output. “Traditionally, [Hadoop] was developed on Ethernet, but even if you have 10GbE, especially in large data sets, you will be shocked. This is exactly the infrastructure where the benefits of our design are obvious, so you can really scale your applications as flexibly as possible and get maximum performance and scalability from them, ”said the researcher.
“A common misconception in the Hadoop community regarding InfiniBand is that it is too expensive and too“ good ”for clusters in low-cost, most common devices. This is true if you have small assemblies. But for larger clusters, InfiniBand is more cost effective than Ethernet.
“If you go to very large cluster systems, InfiniBand FDR is much more efficient and more profitable than 10GbE,” said the professor. - "If you have a cluster of 4 or 16 nodes, you will not see the difference, but if you have 1000 nodes, 2000 or 4000 nodes, you will see a significant difference in cost."
“As in the race cars, where the slowest speeds are slowed down by the slowest machines, in Hadoop, clusters may not work fast due to the slower parts of the structures,” said Panda. “You can have very good equipment, but if you have weak tires, you will not get all the benefits of technology. We see that I / O and the network need to be balanced for the best possible performance. ”
Hadoop caution
Until now, the variable speed on the network has played a major role in the performance of Hadoop, there are other, not quite obvious factors. As usual, the devil is in the details.
In July of this year, Microsoft and the
Barcelona Supercomputing Center launched the Aloja project in order to establish a specification for the performance of the Hadoop platform. This project has
identified more than 80 customizable Hadoop options that affect performance. This includes factors related to physical hardware, such as memory size, storage type, and network speed, as well as software factors: number of memory managers and data converters, HDFS block size, and virtual machine size.
Project researchers saw that simply adding InfiniBand to a structure does not affect the performance of Apache Hadoop, which is measured in benchmark tests. However, adding InfiniBand to SSDs on the same network shows a performance increase of 3.5 times compared with SATA and Gigabit Ethernet. At the same time, simply adding SSDs on a gigabit Ethernet network increases productivity only 2 times.
This echoes the opinion of Professor Panda: - “What happens if you use SSDs? Your I / O speed will increase, but it also means that you must have a high-performance network. When using Ethernet from 1 to 10 gigabits, the benefits are obvious, but with InfiniBand you get more advantages, because at the very foundation of this technology there is an opportunity to load your network more. This means that you will get better solutions, such as, for example, RDMA, which will work better on networks of this type. ”
Not everyone lends itself to the “charm” of InfiniBand. Eric Samer, the current CTO and co-founder of
Rocana , in his
post on the Quora website, argues in favor of 10GbE.
Erik believes that InfiniBand greatly exceeds the requirements of ordinary users - “The fact is that as soon as we move to most of all compatible levels of IT infrastructure, there is an overrun of power. For a number of many reasons, I give not comforting predictions, the actual bandwidth will end at around 25Gb per IP address, through the 4X QDR 40Gb port. ” (Honestly, those libraries that Panda develops in the HiBD environment already support InfiniBand from the start, which eliminates the service data flow).
Comparison of Hadoop performance on 10GbE, and InfiniBand over IP and HiBD library for native InfiniBand QDR.Summer talks about his deployed Hadoop 10GbE (possibly through a twisted pair). “The ubiquity of Ethernet is a fact that cannot be dismissed, and with platforms such as Hadoop, I tend to say that it is better to rely on technologies that develop network technologies and simplify data transfer (Cloudera Impala, Tez changes in Apache) Hive, etc.). I also firmly believe that the same data from the cache can be used when transferring in more packets, and this optimizes the transfer of my data from the data center to the recipient on that side. ”
RoCE Application
InfiniBand also competes with new technology, which claims to give the same benefits as InfiniBand, but via Ethernet. This is called RDMA over a converged Ethernet network (RDMA over Converged Ethernet - RoCE) and shows faster throughput and lower latency than traditional Ethernet.
Professor Panda’s research team also develops RoCE libraries for switches and Hadoop network devices and software solutions that support memory caching. Mellanox supports both the RoCE protocol and the InfiniBand protocol in its devices.
Whatever technology users choose for the connection, Professor Panda advises to think carefully and weigh the pros and cons of the various options - “The question is what quality of routing and control is necessary. If an organization feels comfortable with a good sysadmin who understands Ethernet, the RoCE option is preferable for them. But if some organizations have good competence in InfiniBand, then there can be no agony of choice between InfiniBand or RoCE. ”
The volume of data is growing, and companies need to more quickly analyze this data. This forces them to build new clusters — large and fast — with SSDs and multi-core processors. It becomes obvious that the RDMA approach — or InfiniBand or RoCE — needs to be adapted by large data-intensive organizations.
Related Links:
»
Unravelling Hadoop Performance Mysteries (EnterpriseTech)»
Why Big Data Needs InfiniBand to Continue Evolving»
Hadoop: what, where and why»
The most popular network for supercomputers or Why did we choose InfiniBand?
SIM-CLOUD - Fail-safe cloud in GermanyDedicated servers in reliable data centers in Germany!Any configuration, quick build and free installation