Image sourceWhen discussing how to work with big data, the issues of analytics and the problems of organizing the computation process are most often affected. My colleagues and I had the opportunity to work on tasks of another kind - accelerating data access and balancing the load on the storage system. Below I will talk about how we dealt with this.We made our own “recipe” from the already existing “ingredients”: hardware and software tools. First, I will tell you how the task of speeding up access arose. Then consider the hardware and software tool. In conclusion, let's talk about two problems that we had to face in the course of work.
Let's start with a description of the problem.
In an environment that we had to optimize, a horizontally scalable
network storage is used for data
storage . If you are not familiar with these words, do not worry, I'll explain everything now :)
')
A horizontally scalable data storage system (in English, “scale-out NAS”) is a cluster system consisting of a set of nodes interconnected by a high-speed internal network. All nodes are available to the user separately through an external network, for example, via the Internet.

The diagram shows only three nodes. In fact, they can be much more. This is the beauty of scale-out systems. As soon as you need additional disk space or performance, you simply add new nodes to the cluster.
Above, I said that each of the cluster nodes is available separately. It means that a separate network connection can be established with each node (or even several). However, no matter what node the user connects to the cluster, he sees a single file system.
In the data center, the scale-out storage looks like this (the cluster nodes are packed in stylish racks).
In our case, the system shown in the picture was used for data storage:
Isilon from EMC. It was chosen because of its almost unlimited scalability: one cluster can provide up to 30 petabytes of disk space. And from the outside all the space will be available as a single file system.
The problem that we had to solve was related to the specific model of using Isilon. In an environment that we optimize, data is accessed through a data management system. I will not go into details here, because This is a separate big topic. I'll tell you only about the consequences of this approach. Moreover, I will significantly simplify the overall picture in order to concentrate only on those things that are most essential for the future.
A simplified picture of data access in our environment is as follows:

Multiple clients access a data management system that runs on a dedicated server. Clients do not write / read data from Isilone directly. This is done only through a control system that can potentially perform some kind of manipulation on the data: for example, encrypt.
In the diagram, the management system server communicates with only one storage system node (DSS). And this is what we really had. The flow of numerous client requests went to a single storage node. It turns out that the load on the cluster could be very unbalanced in the case when the other nodes were not loaded by other servers or clients.
Isilon, generally speaking, provides excellent opportunities for automatic load balancing. For example, if a server tries to establish a connection with Isilon, then it will be serviced by the node that is least loaded at the moment. Of course, in order for such balancing to become possible, it is necessary to configure and use Isylon accordingly.
However, automatic load balancing on storage is possible only at the network connection level. For example, if a large number of “voracious” compounds accumulate on some cluster node, the storage system will be able to “scatter” them over more free nodes. But in the case of a single loaded connection, the storage system is powerless.
Now a few words about what constitutes the only high-loaded connection that we had to unload. This is just an NFS mount. If you are not familiar with NFS, look under the spoiler.
NfsIn Unix, there is the concept of a virtual file system. This is such a generic interface for accessing information. Through it, you can already access specific file systems. In fact, the file systems of various devices are simply embedded in the local file system and look to the user as part of it. At the lower level, the floppy file system or remote file systems can be used, which can be accessed via the network. One example of such a remote file system is NFS.
Now that the problem is clear, it's time to tell how we solved it.
As I said, we were helped by a piece of hardware and a software solution designed for working with big data. The iron is all the same Isilon. And we were very lucky that two and a half years ago one interesting property was added to it. Without it, dealing with load balancing would be much more difficult. The property in question is HDFS protocol support. It is based on the second ingredient of our recipe.
If you are not familiar with this abbreviation and the technical side of the question, we recommend you as a spoiler.
HDFSHDFS is a distributed file system that is part of Hadoop, a platform for developing and executing distributed programs. Hadoop is now widely used for big data analytics.
A classic computing solution based on Hadoop is a cluster consisting of compute nodes and data nodes. Computing nodes perform distributed computing by loading / storing information from data nodes. Both types of nodes are, rather, the logical components of a cluster, rather than physical. For example, one compute node and several data nodes can be deployed on one physical server. Although the most typical is the situation when two nodes are running on the same physical machine, one of each type.
Communication of computational nodes with data nodes takes place exactly according to the HDFS protocol. The intermediary in this communication is the directory of the file system HDFS, which is represented in the cluster by a node of another type - the name node. If we discard irrelevant reservations, we can assume that there is only one directory node in the cluster.
Data nodes store data blocks. In the same directory, among other things, information is stored about how the blocks relating to specific files are distributed across data nodes.
The process of placing a file in HDFS on the client’s side looks something like this:
- HDFS client requests a directory to create a file.
- if everything is fine, the directory signals that the file has been created
- when the client is ready to write the next block of this file, he again addresses the directory with a request to provide the address of the data node to which the block should be sent
- directory returns the corresponding address
- client sends block to this address
- successful entry is confirmed
- when the client has sent all the blocks, he can request to close the file
Initially, the HDFS interface was not supported in Isilone so that my colleagues and I could use it to balance the load on the storage system. If you're wondering what purpose HDFS is implemented in Isilon, then go to the next spoiler.
Native HDFS supportIn Isilone, the HDFS interface was supported so that the storage system can be used directly with Hadoop. What did it lead to? See diagrams below. The first shows one of the typical scenarios for organizing an Hadoop cluster (not all kinds of nodes that exist in a cluster are depicted)
Worker node is a server that combines the functions of computing and data storage. Data node is a server that only stores data. Next to all the servers are depicted "thick" drives, which host data that are running HDFS.
Why does the image show storage? It is a grocery data store. It stores the files that go into the product environment when performing daily business operations. Usually these files are transferred to the storage system using some widely accepted protocol. For example, NFS. If we want to analyze them, then we need to copy the files (do the staging) into the Hadupovsky cluster. If we are talking about many terabytes, then staging can take many hours.
The second picture shows what changes if an environment with HDFS support is used in the environment. At servers big disks disappear. In addition, servers that were engaged exclusively in providing access to data are removed from the cluster. All disk resources are now consolidated in a single storage system. There is no need to do staging. Analytical calculations can now be performed directly on product copies of files.
Data can still be stored in the storage system using the NFS protocol. And read via HDFS. If, for some reason, calculations cannot be performed on product copies of files, then the data can be copied inside the same storage system. I will not list all the delights of this approach. There are many of them, and much has already been written about them in English-language blogs and news feeds.
Better, I will say a few words about how working with the Isfilonovsky HDFS interface looks from the client. Each of the cluster nodes can act as a HDFS data node (if not prohibited by settings). But what is more interesting and what is not in the "real" HDFS, each node can also perform the role of a directory (name node). At the same time, it must be borne in mind that Isilonov's HDFS, from the point of view of "entrails", has almost nothing in common with the Hadupov implementation of HDFS. The file system HDFS in Isilon is duplicated only at the interface level. The whole internal kitchen is original and very efficient. For example, to protect data, they use their own cost-effective and fast Isonian technologies, as opposed to copying a chain of data nodes, which is implemented in the "standard" HDFS.
Now let's see how HDFS helped us cope with load balancing on Isylon. Let's return to the example of writing a file to HDFS that was parsed above in the spoiler. What do we have in the case of Isilon?
To add another block to the file, the client must refer to the directory in order to find out the address of the data node that will receive this block. In Isilone,
any cluster node can be accessed as a directory. This is done either directly through the node address, or through a special service that balances the connections. The address that the directory returns corresponds to the least loaded node at the moment. It turns out that sending blocks to HDFS, you always transfer them to the most free nodes. Those. you automatically have a very fine, granular balancing: at the level of individual elementary operations, and not mount, as is the case with NFS.
Noticing this, we decided to use HDFS as a standalone interface. “Independent” here means that the interface is activated in isolation from Hadoop. Perhaps this is the first example of this kind. At least, so far I have not heard that HDFS is used separately from the family of Hadup or okolokhadupovsky products.
As a result, we “screwed” HDFS to our data management system. Most of the problems that we had to solve at the same time were on the side of the management system itself. I will not talk about them here, because This is a separate large topic, tied to the same on the specifics of a particular system. But I’ll tell you about two minor issues that are associated with using HDFS as a standalone file system.
The first problem is that HDFS is not allocated to a separate product. It is distributed as part of Hadoop. Therefore, there is no “HDFS standard” or “HDFS specification”. In fact, HDFS exists as a reference implementation from Apache. So if you want to know the details of the implementation (for example, what is the policy of capturing and releasing leases), then you will have to do reverse engineering, or reading the source code, or searching for people who have already done this before you.
The second problem is finding a low-level library for HDFS.
After a superficial web search, it may seem that there are many such libraries. In reality, however, there is one Apache Java reference library. Most other libraries for C ++, C, Python, and other languages ​​are just wrappers around the Java library.
We could not take a Java library for our C ++ project. Even with the appropriate wrapper. First of all, dragging the data management system along with our small HDFS module onto the server was also an impermissible luxury. Secondly, on the Internet there are some complaints about the performance of the Java-library.
The situation was such that if we had not found a ready-made C ++ library for HDFS, we would have to write our own. And this is extra time for reverse engineering. Fortunately, we found a library.
Last year (and maybe even earlier) the first native-libraries for HDFS began to appear. At the moment, I know about two of them: for C and Python.
Hadoofus and
Snakebite . Perhaps something else has appeared. I have not repeated the search for a long time.
For our project we took Hadoofus. For all the time of use, we found only two errors in it. The first - simple - led to the fact that the library was not going to C + + - compiler. The second is more unpleasant: deadlock with multi-threaded use. He manifested itself very rarely, which complicated the analysis of the problem. To date, both errors are fixed. Although we are still working on the full testing of deadlock absence.
We did not have to solve any other problems associated with the use of HDFS.
In general, it should be noted that writing an HDFS client for Isilon is simpler than writing a client for “standard” HDFS. Undoubtedly, any “standard” HDFS client will work seamlessly with Isilon. The reverse does not have to be true. If you are writing an HDFS client exclusively for Isilon, the task is simplified.
Consider an example. Suppose you need to read a block of data from HDFS. To do this, the client accesses the directory and asks which data nodes to take this block from. In general, in response to such a request, the catalog returns the coordinates of not one node, but several, on which copies of this block are stored. If the client fails to receive a response from the first node in the list (for example, this node has fallen), then he will turn to the second, third, etc., until there is a node that will respond.
In the case of Isilon, you do not need to think about such scenarios. Ishilon always returns the address of a single node that will serve you. This does not mean that the Isilon nodes cannot “fall.” In the end, you can disable the node at least with an ax. However, if Isilon, for some reason, loses a node, he simply transfers his address to another - surviving - node in the cluster. So, the fault tolerance scenario is largely already embedded in the piece of hardware, and you do not need to fully implement it in the software.
On this story of our "recipe" can be considered complete. It remains only to add a few words about the results.
The depreciated performance gain, compared to working through NFS, is about 25% in our environment. This figure was obtained by comparing "with themselves": in both cases, the performance was measured on the same hardware and the same software. The only thing that differed was the module for accessing the file system.
If we consider only read operations, then a 25% gain is also observed when downloading each individual file. In the case of recording data, you can only talk about the amortized gain. Writing each individual file is slower than through NFS. There are two reasons for this:
- HDFS does not support multi-threaded file writing.
- Our data management system has features that, due to the above HDFS limitation, do not allow organizing a quick recording of a separate file.
If the files were organized in a data management system more optimally, a 25% gain in writing could be expected for a separate transfer.
I note that slowing the download of each specific file did not upset us greatly, because for us the most significant throughput at peak loads. In addition, in environments similar to ours, reading data is a much more frequent operation than writing.
In conclusion, I will give an illustration that gives an idea of ​​how Isilon’s workload changes when using HDFS as an interface.
The screenshot shows the cluster load when transferring a 2GB file to both sides (the file was downloaded and downloaded 14 times in a row). The blue high peak on the left is obtained when working via NFS.
Reading and writing occur through one mount, and in this case one cluster node takes over the entire load. The multi-colored low peaks on the right correspond to HDFS operation. It can be seen that now the load is “smeared” across all nodes in the cluster (3 pieces).On this, perhaps, everything.Let everything always work for you quickly and reliably!