Today, many companies work with a huge amount of data. No, now I’m not talking about BigData patterns, but simply that it’s impossible to surprise anyone with a dozen or so terabytes of data on the servers of a single company. But many go further - hundreds of terabytes, petabytes, dozens of petabytes ... Of course, it's good when your data and processing tasks fall under the mapreduce ideology, but much more often all this data is either “just files” or virtual machine volumes, or already structured and shaded data. In such cases, the company comes to the idea of the need to deploy storage.
Adds to the popularity of storage today and systems like OpenStack - it’s nice to manage your servers without worrying about the disk not working in one server, that one of the racks is de-energized. Do not worry about the fact that the hardware on one Most Important Server is outdated and for its upgrade it is necessary to degrade your services to the minimum level. Of course, such cases may be a design error, but let's be honest - we all can make such errors. ')
As a result, the company faces a difficult choice: to create storage systems independently based on open source software (Ceph, MuseFS, hdfs - there is something to choose from with minimal integration costs, but you have to spend time on design and deployment) or buy ready-made proprietary storage systems and spend time and forces on its integration (with the risk that the storage system will eventually reach the limit of its capacity or performance).
But what if we take Ceph as a basis, for which it is difficult to come up with an impossible task in the field of data storage, enlist the support of some Ceph vendor (for example, Inktank, which created it), take modern servers with a large number of SAS disks, write a web management interface, add additional features for effective deployment and monitoring ... It sounds tempting, but difficult for an average company, especially if it is not an IT company.
Fortunately, Fujitsu has already taken care of all this, represented by the product ETERNUS CD10000 - the first enterprise-storage system based on Inktank Ceph Enterprise, with which we will introduce you today.
ETERNUS CD10000 itself is a constructor of modules. Modules are x86 servers with installed Linux, Ceph Enterprise and Fujitsu's own practices. This storage design allows you to get the required storage capacity and gradually expand it in the future. Modules are of two types - a module with data and a module with metadata (or rather, a management node).
Storage servers are now represented by three models:
Basic (12.6 TB in one module, 1 SSD for cache, 2U)
Perfomance (34.2 TB, 2 SSD for cache, 4U)
Capacity (252.6TB in one module, 1 SSD for cache, 6U)
Basic and Performance-nodes are equipped with 2.5-inch SAS-drives, and capacity-modules can install up to 14 SAS-drives and 60 SATA-drives simultaneously. Between themselves, the storage communicate through infiniband - this applies to replication operations, recovery of lost copies of blocks, the exchange of other service information. At any time, you can install additional storage servers, thereby expanding the total disk storage capacity - Ceph Enterprise will redistribute the load on the storage / disks. A total of 224 servers can be installed under the data. Today it is about 56 petabytes, but disk volumes are growing, the possibilities of software stuffing in theory are limited to exabyte per cloud storage. The advantage in this situation is that it will be possible to add new generation servers in ETERNUS together with servers of previous generations (and they can work together). Obsolete storage nodes over time can simply be unplugged from the outlet - Ceph replicates the missing data to the remaining nodes without additional intervention.
Management-nodes are engaged in storing logs and events occurring in the repository. It is recommended to install 2 such servers, but in general the system can work even if the management-node is no longer available.
The CD10000 has a web-based interface that allows you to do most storage operations and view the status of individual nodes or the storage as a whole. The classic CLI interface, familiar to many administrators who have worked with Ceph directly, has not gone anywhere. Problems with the "communication" of people with this system should arise.
Now about how ETERNUS can "talk" with other servers. To begin with about hardware - each storage server is connected to a regular network with 10 Gigabit interfaces. If we are to be absolutely accurate, then the two-port cards PRIMERGY 10Gb Modular LAN Adapter (with an Intel 82599 chip inside them). Bottle neck, they are unlikely to become.
At the program level, all fantasies of users of such products are also taken into account. There are 4 interfaces for storage clients:
Librados (designed for direct interaction with the repository using a ready-made library of applications written in C / C ++ / Java / Python / PHP / Ruby)
Ceph Object Gateway (RGW - here you will find a REST API compatible with Amazon S3 and Swift)
CephFS (POSIX-compatible network file system, with drivers for FUSE)
Upon a separate client request, an additional interface from a number of standard scripts may appear in its installation.
The heart, brain and soul of Fujitsu ETERNUS CD10000 became Ceph Object Storage (or RADOS) - it deals with load distribution between nodes / disks, block replication, recovery of lost replicas, reclustering of storage. In general, all with regard to performance and reliability. RAID arrays are not used in their normal use case. It’s hard to imagine how much a single array rebuild will take on dozens of 6 TB disks. And how often it will occur.
And if there are several thousand disks? RADOS solves the problem of disk failure faster - it does not need to re-read the surface of all blocks of the array (including empty ones, when compared with the same mdadm). It only needs to make additional copies of those blocks that were stored on a disk deleted from the storage. Problems of disabled node-stores are solved in the same way - RADOS will be able to find those blocks whose number of replicas do not correspond to the settings of the store. Of course, replicas of one block will never be stored on one node. For each data set at the software level, the block size, the number of copies (replicas) of each block are determined, and on what type of media these replicas should be created (slow, fast, or very fast media). For datasets, where the basic requirement lies in the field of economics, rather than access speeds, you can create a storage level reminiscent of RAID 6. It may be too expensive to store multiple copies of data, even in a multi-petabyte system.
Inside RADOS, the CRUSH (Controlled Replication Under Scalable Hashing) algorithm is used - a hierarchical tree is built from equipment of different levels (rack, server, disks), which contains information about the available volumes, location and availability of disks. Based on this tree, RADOS is already deciding where to store copies of the blocks in the required quantity. By the way, the system administrator can edit this tree manually. The same algorithm also ensures that there is no need for a single repository of information on where to search for a block — any “iron” RADOS participant is able to respond to a request for any data block, which saves us from another storage failure point.
As a nice feature, you can note the ability to work Fujitsu ETERNUS CD10000 in several data centers. True, the speed of light cannot be fooled - you should not place cluster wings further than 80 kilometers along the length of the optics from each other (which, however, allows placing a cluster in two different cities of the Moscow region, for example), otherwise the storage may not work correctly due to high RTT. In this case, the storage system will work in the split-site configuration mode, but still it will remain the same storage system with the same data set inside.
Thus, we have a storage system that easily integrates into any infrastructure, is fault-tolerant and sufficiently reliable, is based on Fujitsu high-quality equipment, scales easily to different data volumes and different performance requirements, is free from performance bottlenecks, and has enterprise status -product and technical support from a global company with rich experience.