📜 ⬆️ ⬇️

NetApp StorageGrid object storage

In this article, I deviate from the traditional theme of FAS storage systems for me and bring up the topic of object data storage in NetApp StorageGrid WebScale systems. In short, object storage is the third storage type along with NAS and SAN. Imagine that each file consists of data and meta information (owner, rights, modification time, etc.), so object storage allows you to separate these parts and store them in the form of "key / value". Such an information storage approach opens up opportunities for decentralized, distributed data storage on a huge scale with transparent data migration, replication and transparent switching of end users between the nodes of an object cluster. In a broad sense, object storage can be implemented both at the device level (hard disk), using specialized SCSI commands (Object-based Storage Device Commands), and at the level of the storage system access protocol, which consists of several disks (which their turn is not at all obliged to be objective). In both cases, Ethernet is used for connection and IP protocol for data transfer. An example of the implementation of object-level storage at the device level is hard drives of the Seagate Kinetic Open Storage platform line. An example of data storage in the cloud could be Microsoft Azure BLOB, Amazon S3. In this article, I will focus on object storage systems that can be deployed on your site and, if necessary, connected to the cloud. The object protocols S3, SWIFT, CDMI have gained wide popularity, all of which are add-ons over HTTP.



Story


Initially, the StorageGRID product was developed for public health, as the storage of millions of large and small objects required a specialized solution. Large manufacturers of health equipment such as Siemens, AGFA and other large PACS systems support the ability to send objects directly to StorageGRID. This approach made it possible to implement a scenario not previously possible for file storages, for example, when the doctor needs to obtain patient data for the last 10 years, although the patient has moved from Minnesota to Los Angeles. StorageGRID is still in high demand in the healthcare industry, but could also find application in cloud solutions for storing a variety of data.

The NetApp StorageGrid family consists of two representatives:
  1. Pure Software, NetApp StorageGrid WebScale
  2. NetApp StorageGrid appliance based on E-Series - SG5660.

Both of these options can coexist in the same cluster.

')
The first option consists of a dual-controller system, where one controller is the Storage node and the second is the Compute node. Those. There are two controllers in the chassis, but this is not a High-Avalability system by itself - at least 2 such SG5660 systems (ie 4 controllers) are recommended, for fault tolerance. For a hardware solution other than the SG5600, you must also have at least 2 servers to host the Gateway and Admin nodes.
The second option (software) can be supplied as an ESXi applause or as a Docker image based on Debian Linux. In this variant, you can use one regular E-Series, with two controllers and High-Availability, with standard OS SANtricity and on top of all this at least two servers with Storage, Gateway and Admin nodes. In the software version, more server capacity is needed because of the need to maintain the Storage node (the most demanding of resources) on the server.


The growth of unstructured data is steadily gaining momentum, examples of such data generators are the emerging IoT market, photo and video cameras on the market with unprecedented resolution and frame quality, medical equipment and other devices. To close all of these tasks, the StorageGrid product was developed from scratch:

Web data repositories

Data archives

Media repositories


StorageGrid main features


Allows you to manage geodistributed unstructured data. With a single control panel, management policies across all sites where the cluster nodes are located, StorageGrid, thus pulling the data to where they are needed. Tape libraries and RESTful HTTP-like protocols such as CDMI, S3 and Swift are supported, with which the system can be integrated with cloud providers. Data can seamlessly move between all levels: local storage, cloud and tape libraries.
Storage Grid Web GUI



The advantages of the StorageGrid platform include:



Erasure coding


Almost all object storage systems are able to store multiple copies of a single object (replication), duplicating data on different nodes and sites, thus ensuring fault tolerance. And Erasure Coding (EC) is a mechanism similar to RAID, but running at the level of an object, which is broken into several parts, and not at the level of entire hard drives. EC allows significantly less storage space, providing a fault tolerance mechanism.


Geo-EC

Geo Distributed Erasure Coding is an EC, where parts of an object that make up such a “RAID group” can be on systems located in different parts of the world, store two or three copies of data, and allow you to achieve incredible levels of accessibility, but this generates a corresponding amount traffic and space occupied. This is where the geo-distributed Erasure Coding function comes to the rescue so that failover and availability are not impaired, significantly reducing the amount of space occupied. The following EC schemes are available:


Erasure Coding on the one hand saves disk space, on the other hand adds overhead in calculating the checksum and recovering the object. In the case of Geo-EC, when reading an object, the response rate is still increasing as the reading is performed from two sites. Those. EC needs to be used wisely, and later in ILM I will tell you how.

Hierarchical ec

StorageGrid allows you to distribute data based on their durability and fault tolerance policies. Hierarchical Erasure Coding allows local EC and Geo-EC to be automatically executed based on these policies. Hierarchical EC is well suited for installations with at least 3 sites to protect against the failure of the entire site.


DDP - local EC

Dynamic Disk Pools (using StorageGrid WebScale as a local EC) is the functionality of the NetApp E-Series hardware, a kind of RAID, like regular RAID groups, it is created on a single local system. DDP allows not to lose in performance in case of local failure of one or several disks (otherwise objects will pull up from other nodes or sites), plus it saves electricity and network (WAN / LAN) traffic: data access and recovery will be performed locally. This functionality perfectly complements Geo-EC.


Information Lifecycle Management


ILM in StorageGRID systems will allow flexible and much more efficient use of disk space due to data lifecycle policies. So, for example, you can set up a policy so that if an object was recorded or there was at least one access to it within 30 days, you can store X copies of it on several different sites. If there were no calls to him for more than 30 days, then deleting copies and driving him through the EC, in this case, the increased reading time of the object will not be such a problem. And if the object has not been accessed for 1 year, then send it to the cloud or tape. It is important to note that the example above is granular at the level of each individual object and not at the level of a large dataset, such as a LUN or a file ball (in a SAN or NAS, respectively). If the price of resources changes, the policy will tighten and re-distribute the data in accordance with the new changes.


Durability


It can be divided into two parts: integrity and data availability.
Data integrity is ensured by: using digital hash sums when data is written, read, migrated, and periodically checked. Damaged objects are transparently recreated from copies. Erasure Coding geo-distributed mechanism allows you to economize on the use of space for storing copies of data.
Availability of data is ensured through Fault-tolerent architecture, support for uninterrupted operations, software updates and platform equipment. Load distribution, both during normal operation and in case of failure. NetApp AutoSupport can automatically notify support for pre-reactive problem resolution. Erasure coding at the node level improves the availability of each node, recovery time, impact on performance and network activity (available only on the E-Series platform with Dynamic Disk Pools).

NAS


A NAS functional with CIFS / NFS protocols can be implemented using a NAS bridge. This will not modify the existing infrastructure and provide standard file access to end users. StorageGrid, in turn, thanks to life cycle policies, can, based on meta-information (for example, the last modification time or file creation), transparently move this data across storage levels. Licenses for file bridges are included in the StorageGrid distribution, they do not need to be re-purchased. Integration with Active Directory and LDAP is supported. NAS bridge, as it were, “from above” provides access via CIFS / NFS and this is the most common NAS.
Bottom on the back end, the same NAS bridge is connected via an object protocol and converts files to objects, then they are already stored and processed as ordinary objects.

Security


Support for end-to-end encryption of each object and Secure Multi-Tenancy. Support for authentication and security mechanisms for S3 and CDMI. LDAP / AD integration for user authentication within one Tenant is supported.

Production-Ready


This is a very important point, when the customer has no army of programmers and administrators, it is important that the complex is reliable. StorageGrid technologies are more than 14 years old (in 2001, the first installation), and it managed to build a large number of integrations with other known products for backup, archiving, file synchronization, collaboration, etc.



StorageGrid Licensing Policy


The product is licensed per terabyte, regardless of the number and type of nodes. The hardware and software implementation of StorageGRID can coexist in the same cluster. All possible functionality is included in the basic delivery:


findings


StorageGRID is a product for applications that support RESTful HTTP, which is suitable for large and small objects, high bandwidth, transactional and automatic, transparent movement of data across storage levels. Geo-clustering allows you to achieve incredibly high resiliency and data availability, hiding the failure of entire sites. EC technology allows you to significantly save space by using a RAID-like architecture. StorageGRID supports multiple storage tiers and is one of the most advanced data lifecycle management mechanisms. ILM will automatically move data when the price changes at a particular storage level, which will allow for more efficient use of resources and flexible response to changes in the cost of data storage (for example, in the Cloud or in Tape libraries). StorageGRID is an established, mature product with a broad list of third-party software integration, which simplifies support and integration into existing infrastructure. Encrypting objects and supporting LDAP / AD authentication helps protect against data theft. StorageGRID can be a complete replacement for Amazon S3, allowing you to rent to multiple companies for storage and acting as a private cloud for those who cannot place data in a public cloud. And it can be an addition to AWS S3, using it as a data storage level and has a mechanism for calculating the cost of their storage for StorageGRID storage tenants.

This may contain links to Habra articles that will be published later .
I ask to send messages on errors in the text to the LAN .
Comments, additions and questions on the article on the contrary, please in the comments .

Source: https://habr.com/ru/post/279743/


All Articles