Current trends in the development of web applications and the exponential growth of information processed by them, has led to the need for the emergence of file systems aimed at ensuring high performance, scalability, reliability and availability. Aside from this problem, such search industry giants as
Google and
Yahoo could not remain.
Specificity of applications and computing infrastructure of Google, built on a huge number of inexpensive servers, with their inherent permanent failures, led to the development of its own private distributed file system
Google File System (GFS) . This system is aimed at automatic recovery after failures, high fault tolerance, high throughput when accessing data in streaming mode. The system is designed to work with large amounts of data, implying large sizes of stored files, so GFS is optimized for the relevant operations. In particular, in order to simplify implementation and increase efficiency, GFS does not implement a standard POSIX interface.
The answer from GFS was the open source
Hadoop project, with its Hadoop Distributed File System.
The project is actively supported and developed by Yahoo (18 people). We will carry out a comparative analysis of the terms used in these systems, establish their correspondence and dwell on HDFS:
| HDFS | Gfs |
Main server | NameNode | Master |
Slave servers | DataNode Servers | Chunk servers |
Append and Snapshot operations | - | + |
Automatic recovery after the failure of the main server | - | + |
Implementation language | Java | C ++ |
HDFS is a distributed file system used in the Hadoop project. HDFS cluster primarily consists of NameNode-server and DataNode-servers that store data directly. The NameNode server manages the file system namespace and client access to data. To offload the NameNode server, data transfer is performed only between the client and the DataNode server.
')

Secondary NameNode:
The main NameNode server captures all transactions related to changing file system metadata in a log file called EditLog. When you run the main NameNode server, it reads the HDFS image (located in the FsImage file) and applies to it all the changes accumulated in the EditLog. Then, a new image is recorded with the changes applied, and the system starts working with a clean log file. It should be noted that the NameNode server performs this work once upon its first start. Subsequently, such operations are assigned to the secondary NameNode server. FsImage and EditLog are ultimately stored on the main server.
Replication mechanism:

When the NameNode server detects a failure of one of the DataNode servers (no heartbeat messages from it), the data replication mechanism is started:
- selection of new DataNode servers for new replicas
- balancing data placement on DataNode servers
Similar actions are performed in case of damage to the replicas or in the case of an increase in the number of replicas inherent in each block.
Strategy replica placement:
Data is stored as a sequence of blocks of fixed size. Copies of blocks (replicas) are stored on several servers, by default - three. Their placement is as follows:
- the first replica is located on the local node
- the second replica on another node in the same rack
- the third replica on an arbitrary node of another rack
- other replicas are placed in an arbitrary way.
When reading data, the client selects the DataNode server closest to it with a replica.
Data integrity:
The weakened data integrity model implemented in the file system does not guarantee replica identity. Therefore, HDFS shifts data integrity checks to clients. When creating a file, the client calculates checksums every 512 bytes, which are subsequently stored on the DataNode server. When reading a file, the client accesses the data and checksums. And, in the event of their inconsistency, another cue is accessed.
Data Record:
“When writing data to HDFS, an approach is used that allows to achieve high bandwidth. The application is recording in streaming mode, while the HDFS client caches the recorded data in a temporary local file. When data is accumulated on a single HDFS block in a file, the client accesses the NameNode server, which registers a new file, allocates the block and returns to the client a list of datanode servers for storing block replicas. The client starts sending the block data from the temporary file to the first DataNode server from the list. The DataNode server saves data to disk and sends it to the next DataNode server in the list. Thus, the data is transferred in a pipeline mode and replicated on the required number of servers. Upon completion of the recording, the client notifies the NameNode server, which records the file creation transaction, after which it becomes available in the system. ”
Deletion of data:
By virtue of ensuring data integrity (in case of a rollback of an operation), deletion in the file system occurs according to a certain method. First, the file is moved to a specially designated / trash directory, and after a certain time has elapsed, it is physically deleted:
- remove file from HDFS namespace
- release of data-related blocks
Current disadvantages:
- lack of automatic start of the main server in case of its failure (this functionality is implemented in GFS)
- lack of operations append (assumed in version 0.19.0) and snapshot (these functionalities are also implemented in GFS)
You can read what will happen in the next versions of HDFS in the project wiki on the
Apache Foundation website. Additional information and opinions of people working with Hadoop can be found in the blogs of companies actively using this technology:
Yahoo ,
A9 ,
Facebook ,
Last.fm ,
LaboratorySources:
- Dhruba B. Hadoop Distributed File System, 2007
- Tom W. A Tour of Apache Hadoop
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung The Google File System
- Sukhoroslov O.V. New technologies for distributed storage and processing of large data arrays
This article is introductory, its goal: to light the reader into the atmosphere of relevant developments. In the case of positive feedback and / or interest from readers, we will prepare a number of additional related articles:- Installing Hadoop Core + Hbase on Windows OS (+ php class that implements interaction with Hbase using the REST API)
- Translation of the article: "MapReduce: Simplified Data Processing on Large Clusters "