File System and Hadoop: The Twitter Experience (Part 2)

Our main principle of operation is that IaaS should be simple and understandable even for those who have not encountered the IT sphere. Therefore, we carry out continuous optimization of all systems and talk about what we managed to do in our blog on Habré.

A couple of examples:

Today we decided to continue a brief analysis of the notes of the Twitter engineers team about creating a file system for working with Hadoop clusters.
')

/ photo Mercado Viagens / CC

An additional thematic development of Twitter engineers is the Nfly project. Its task is to ensure the creation of a single reference path ViewFs for multiple clusters.

At its core, this is something like creating a virtual file system that will be monotonous and make it easier to work with different data centers. In addition to a single logical path, it is possible to get here a situation where the service can read data from a cluster / DC2 / C, when / DC1 / C is unavailable.

Engineers use the ChRootedFileSystem. This approach allows you to replace the root of the path and use the path of the hdfs: // dc1-A-user / user / lohit type, which allows you to work with several file systems located in different data centers.

Now the link will represent a number of file systems that will be called synchronously using a wrapper in the ChRootedFileSystem. This is where Nfly is involved, which defines one logical path for a number of physical ones.

In this case, Nfly creates temporary files that are subsequently stamped with mtime, and the entire transaction is checked for errors. It is possible to sort by temporary stamp. This approach uses additional computational resources, but forgives working with the system, which at this stage, everything looks very difficult - you need to deal with the settings of ViewFs and use Nfly.

From the point of view of working with data, engineers get more flexible tools that really make the processes more understandable. For example, to get cluster X for a specific user in all data centers, you only need to specify -Dfs.nfly.mount = X.

Subsequent work on optimizing this system is to reduce the number of mount tables and use the Merge FileSystem to enable partitioning of the namespace without the need to add additional settings.

PS A little bit about the work of our virtual infrastructure provider:

Source: https://habr.com/ru/post/268129/

All Articles

File System and Hadoop: The Twitter Experience (Part 2)

More articles: