Hadoop 3.0: a brief overview of new features

Apache Software Foundation announced the release of a new version of an open framework for developing and executing distributed programs - Hadoop 3.0. This is the first major release since the release of Hadoop 2 in 2013. In more detail about some new features of Hadoop 3.0 and about what the next versions will offer, we will tell further.

/ photo by Chris Feser CC

What's new

Redundancy Codes (Erasure Coding) for HDFS

This is a data protection method that was mainly used in object stores. Now Hadoop will not store three copies of data on different clusters. Instead, a data fragmentation method is used that is similar to RAID 5 or 6. Thanks to more efficient replication, storage efficiency is increased by 50%.
')
Vinod Kumar Vavilapalli (Vinod Kumar Vavilapalli), one of the update developers and CTO of Hortonworks, says that the reason for adding Erasure Coding was the growth of data stored by companies in Big Data clusters. The new feature will allow you to more effectively manage the storage and use the capabilities of the Hadoop cluster to the maximum.

YARN Federation

According to Vavilapalli, initially YARN scaled only to 10 thousand nodes. Therefore, developers from Microsoft have added the YARN Federation function, which allows YARN to work with 40 or even 100 thousand nodes.

New Resource Types

Hadoop 3.0 adds a new framework to YARN to work with additional types of resources, in addition to memory and CPU. The update also gives more disk control in Big Data clusters. In the future, support for GPU and FPGA will be added.

Java 8

Hadoop 2 runs on Java Developers Kit 7. In Apache Hadoop 3.0, the transition to JDK 8 was made. In addition to Hadoop 3.0, support for JDK 8 was announced in HBase 2.0, Hive 3.0 and Phoenix 3.0.

Roadmap: Hadoop 3.1 and 3.2

The developers told what functions will be added in future versions of the framework. Let's look at the main features.

Hadoop 3.1

GPU support. This will give customers the opportunity to solve problems of machine and deep learning.
Docker container support. This will allow non-big data workloads to run on Hadoop.
YARN Services. You will be able to work with “long-term” workloads, such as the Kafka data stream.

Hadoop 3.2

FPGA support. There are tasks that FPGA performs better than a GPU. This is already understood in Microsoft, where FPGA is used to accelerate deep learning. You can read more about the practical use of FPGA here .
Ozone. Vavilapalli explains that HDFS is “sharpened” for storing large files in the “once written, read many times” format. And it is not suitable for storing small files: photos or videos. Ozone storage will solve this problem.

The developers plan to release updates of Hadoop 3.1 and Hadoop 3.2 with a difference of three months.

PS Other materials from the first corporate IaaS blog:

Source: https://habr.com/ru/post/344596/

All Articles