Increase data processing speed with data locality in Hadoop

Author: Andrey Lazarev

One of the main bottlenecks in computing that requires processing large amounts of data is network traffic passing through the switch. Fortunately, the execution of the map code on the node where the data is located makes this problem much less serious. This method, called data locality, is one of the main advantages of the Hadoop Map / Reduce model.

In this article, we will look at data locality requirements, how the OpenStack virtualized environment affects the topology of the Hadoop cluster, and how to ensure data locality when using Hadoop with Savanna.
')

Hadoop data locality requirements

To take full advantage of data locality, you need to make sure that the architecture of your system meets a number of conditions.

First, the cluster must have an appropriate topology. Hadoop map code must be able to “locally” read data. Some popular solutions, such as network storage (network storage [NAS] and network storage [SANs]), will always generate network traffic, so in a sense you may not consider them to be “local”, but in reality It depends on the point of view. Based on your situation, you can define “local” as “located within one data center” or “everything that is on the same rack”.

Secondly, the Hadoop cluster must be aware of the topology of the nodes on which tasks are performed. Tasktracker nodes are used to perform Map tasks, so the Hadoop Scheduler needs network topology information to properly assign tasks.

Last but not least, the Hadoop cluster must know where the data is located. This can be a bit more complicated due to the support of various storage systems in Hadoop. For example, HDFS supports default data locality, while other drivers (for example, Swift) require expansion to be able to provide topology information in Hadoop.

Hadoop cluster topology in virtualized infrastructure

Hadoop typically uses a 3-tier network topology. Initially, the data center, the rack and the node belonged to these three levels, although the case with a cross-data center is not common and this level is often used to define upper level switches.

This topology is well suited for the traditional implementation of Hadoop clusters, but the virtual environment is difficult to display on these three levels, because they have no place for the hypervisor. Under certain conditions, two virtual machines running on the same host machine can communicate much faster than if they were running on separate host machines, since no network is involved. Therefore, starting with version 1.2.0, Hadoop supports 4-level topology.

This new level, called the “host group,” corresponds to the hypervisor that hosts the virtual machines. In separate virtual machines, there can be several nodes on a single host machine that are controlled by one hypervisor, which makes it possible to interact without passing data through the network.

Locality of data in Savanna and Hadoop

So, knowing all of the above, how to put this into practice? One option is to use Savanna to configure your Hadoop clusters.

Savanna is an OpenStack project that allows you to deploy Hadoop clusters on top of the OpenStack platform and perform tasks on it. In the latest release of Savanna 0.3, data locality support was added for the Vanilla plug-in (in the Hortonworks plugin, data locality support is planned for the upcoming release of Icehouse). With this enhancement, Savanna can promote the configuration of the cluster topology in Hadoop and ensure data locality. This allows Savanna to support both 3- and 4-level network topology. For the 4-tier topology, Savanna uses the host ID of the compute node as the identifier for the Hadoop cluster node group. (Note: Be careful not to confuse the Hadoop Cluster Node Groups and Savanna Node Groups, which have different goals).

Savanna can also provide data locality for Swado input streams in Hadoop, but this requires adding a certain Swift driver to Hadoop, since the Vanilla plugin uses Hadoop 1.2.1 without native support for Swift. The Swift driver was developed by the Savanna project team and is already partially integrated into Hadoop 2.4.0. It is planned to fully integrate it into 2.x repo, and then apply it on the old version 1.x.

Using data locality in Swift involves enabling the data locality function in Savanna, and then specifying both the Compute topology and the Swift topology. The video below shows how to start a Hadoop cluster in Savanna and configure data locality on it:

http://youtu.be/XayHkbmjK9g

Conclusion

Data processing in the Hadoop virtual environment is probably the next step in the development of big data. As clusters grow, optimizing the resources consumed becomes critical. Technologies such as data locality can significantly reduce network utilization and allow you to work with large distributed clusters without losing the benefits of using smaller and more local clusters. This makes the Hadoop cluster's scalability almost infinite.

Original article in English .

Source: https://habr.com/ru/post/219177/

All Articles

Increase data processing speed with data locality in Hadoop

Hadoop data locality requirements

Hadoop cluster topology in virtualized infrastructure

Locality of data in Savanna and Hadoop

Conclusion

More articles: