Our main principle of work is that IaaS should be simple and understandable even for those who have not experienced IT. Therefore, we are constantly optimizing all systems and talk about what we managed to do in our blog on Habré.
A couple of examples:
Today we decided to take a look at the western experience and briefly analyze the
note of the team of engineers on Twitter, in which they talked about their approach to working with the file system for Hadoop clusters.
')
/ photo Mercado Viagens / CC
The author talks about working with ViewFs, the client-side Hadoop file system — the clusters that Twitter uses. The task of this system is the formation of a single namespace, covering all data centers and clusters. Setting ViewFs complicates the incredible scale of data volumes (more than 300 petabytes) with which the service works. In this situation, the company decided to develop its version of the file system TwitterViewFs.
To work with such a giant, the
federalization of multiple namespaces is used. This approach allows scaling to occur, but this is not so easy to do. Everything is complicated by the fact that each member of the namespace of the federation has its own
URI . As a result, under one logical URI there is
a mount table ViewFs.
To work with this design, the Hadoop “wrapper” is used, which allows you to access the appropriate version based on the mapping from the configuration directory. The previous Twitter solution was to use
redirects to the right Namenode. This approach was insufficient for all the necessary use cases of the system in working condition. Therefore, now all cluster configurations “merge” on the client side.
In the case of Twitter, you need to work with full-sized URIs, which include a substantial portion of the authorization information. This solution allows you to get rid of a huge number of different URIs.
The service engineers decided to make a universal system and mount the conditional path / file / path in the C1 cluster in the DC1 data center into each cluster as / DC1 / C1 / path / file. This allows you not to specify the full URI.
To maintain global coverage of all data centers, they use a special algorithm that configures with the automatic initialization of the file system. The TwitterViewFs namespace is constructed in a few steps.
When the TwitterViewFs namespace is defined, the HA name service URI resolution is required. To do this, you need to combine the HDFS client configuration from all hdfs-site.xml files in the appropriate directories. Thus, all clusters will be reachable via the adopted path / DC / C / path regardless of belonging to a specific Hadoop cluster.
For the local file system and HDFS, it suffices to remember simple and familiar commands. For example, adding ViewFs links to TwitterViewFs configuration:
/local/user/->file:/home/
/local/tmp->file:/${hadoop.tmp.dir}
:
hadoop fs -cp /user/laurent/debug.log /local/user/laurent/
PS Hadoop: Twitter ( 2) .