In this article, I deviate from the traditional theme of FAS storage systems for me and bring up the topic of object data storage in NetApp StorageGrid WebScale systems. In short, object storage is the third storage type along with NAS and SAN. Imagine that each file consists of data and meta information (owner, rights, modification time, etc.), so object storage allows you to separate these parts and store them in the form of "key / value". Such an information storage approach opens up opportunities for decentralized, distributed data storage on a huge scale with transparent data migration, replication and transparent switching of end users between the nodes of an object cluster. In a broad sense, object storage can be implemented both at the device level (hard disk), using specialized SCSI commands (Object-based Storage Device Commands), and at the level of the storage system access protocol, which consists of several disks (which their turn is not at all obliged to be objective). In both cases, Ethernet is used for connection and IP protocol for data transfer. An example of the implementation of object-level storage at the device level is hard drives of the Seagate Kinetic Open Storage platform line. An example of data storage in the cloud could be Microsoft Azure BLOB, Amazon S3. In this article, I will focus on object storage systems that can be deployed on your site and, if necessary, connected to the cloud. The object protocols S3, SWIFT, CDMI have gained wide popularity, all of which are add-ons over HTTP.

Story
Initially, the StorageGRID product was developed for public health, as the storage of millions of large and small objects required a specialized solution. Large manufacturers of health equipment such as Siemens, AGFA and other large
PACS systems support the ability to send objects directly to StorageGRID. This approach made it possible to implement a scenario not previously possible for file storages, for example, when the doctor needs to obtain patient data for the last 10 years, although the patient has moved from Minnesota to Los Angeles. StorageGRID is still in high demand in the healthcare industry, but could also find application in cloud solutions for storing a variety of data.
The NetApp StorageGrid family consists of two representatives:
- Pure Software, NetApp StorageGrid WebScale
- NetApp StorageGrid appliance based on E-Series - SG5660.
Both of these options can coexist in the same cluster.

')
The first option consists of a dual-controller system, where one controller is the Storage node and the second is the Compute node. Those. There are two controllers in the chassis, but this is not a High-Avalability system by itself - at least 2 such SG5660 systems (ie 4 controllers) are recommended, for fault tolerance. For a hardware solution other than the SG5600, you must also have at least 2 servers to host the Gateway and Admin nodes.
The second option (software) can be supplied as an ESXi applause or as a Docker image based on Debian Linux. In this variant, you can use one regular E-Series, with two controllers and High-Availability, with standard OS SANtricity and on top of all this at least two servers with Storage, Gateway and Admin nodes. In the software version, more server capacity is needed because of the need to maintain the Storage node (the most demanding of resources) on the server.

The growth of unstructured data is steadily gaining momentum, examples of such data generators are the emerging IoT market, photo and video cameras on the market with unprecedented resolution and frame quality, medical equipment and other devices. To close all of these tasks, the StorageGrid product was developed from scratch:
Web data repositories- For small objects with extremely high transaction loads
- To store billions of objects
Data archives- Large objects, low transaction load
- Long-term storage, not demanding on the speed of response
Media repositories- Globally distributed, large objects
- StorageGRID is great for healthcare.
- StorageGRID is also very suitable for Video on Demand (VoD) tasks.
- Stream data access, high bandwidth
StorageGrid main features
Allows you to manage geodistributed unstructured data. With a single control panel, management policies across all sites where the cluster nodes are located, StorageGrid, thus pulling the data to where they are needed. Tape libraries and RESTful HTTP-like protocols such as CDMI, S3 and Swift are supported, with which the system can be integrated with cloud providers. Data can seamlessly move between all levels: local storage, cloud and tape libraries.

The advantages of the StorageGrid platform include:
- Support for all the most popular object protocols
- Extensibility to 100 billion objects (375 million per node), 70PiB information (soft limit)
- Distribution: up to 16 sites
- The ability to use tape libraries as a level to store data archives
- When policies change, the data life cycle will be automatically adjusted to match the changes.
- One of the most advanced data lifecycle policy ( ILM ) settings: Automatic distribution of data across local levels (SSD, SATA, SAS, Ribbon, Geo-EC), public clouds (such as AWS S3), and between customer sites. The distribution of data can be made on the basis of information about the cost of data, the need for their level of protection, performance, availability, cost of the network and the durability of the stored data.

Erasure coding
Almost all object storage systems are able to store multiple copies of a single object (replication), duplicating data on different nodes and sites, thus ensuring fault tolerance. And
Erasure Coding (EC) is a mechanism similar to RAID, but running at the level of an object, which is broken into several parts, and not at the level of entire hard drives. EC allows significantly less storage space, providing a fault tolerance mechanism.

Geo-EC
Geo Distributed Erasure Coding is an EC, where parts of an object that make up such a “RAID group” can be on systems located in different parts of the world, store two or three copies of data, and allow you to achieve incredible levels of accessibility, but this generates a corresponding amount traffic and space occupied. This is where the geo-distributed Erasure Coding function comes to the rescue so that failover and availability are not impaired, significantly reducing the amount of space occupied. The following EC schemes are available:
- 2 + 1 for three sites
- 4 + 2 for three sites
- 6 + 3 for three sites
- 9 + 3 for four sites
- 8 + 2 for five sites.

Erasure Coding on the one hand saves disk space, on the other hand adds overhead in calculating the checksum and recovering the object. In the case of Geo-EC, when reading an object, the response rate is still increasing as the reading is performed from two sites. Those. EC needs to be used wisely, and later in
ILM I will tell you how.
Hierarchical ec
StorageGrid allows you to distribute data based on their durability and fault tolerance policies. Hierarchical Erasure Coding allows local EC and Geo-EC to be automatically executed based on these policies. Hierarchical EC is well suited for installations with at least 3 sites to protect against the failure of the entire site.

DDP - local EC
Dynamic Disk Pools (using StorageGrid WebScale as a local EC) is the functionality of the NetApp E-Series hardware, a kind of RAID, like regular RAID groups, it is created on a single local system. DDP allows not to lose in performance in case of local failure of one or several disks (otherwise objects will pull up from other nodes or sites), plus it saves electricity and network (WAN / LAN) traffic: data access and recovery will be performed locally. This functionality perfectly complements Geo-EC.

ILM in StorageGRID systems will allow flexible and much more efficient use of disk space due to data lifecycle policies. So, for example, you can set up a policy so that if an object was recorded or there was at least one access to it within 30 days, you can store X copies of it on several different sites. If there were no calls to him for more than 30 days, then deleting copies and driving him through the EC, in this case, the increased reading time of the object will not be such a problem. And if the object has not been accessed for 1 year, then send it to the cloud or tape. It is important to note that the example above is granular at the level of each individual object and not at the level of a large dataset, such as a LUN or a file ball (in a SAN or NAS, respectively). If the price of resources changes, the policy will tighten and re-distribute the data in accordance with the new changes.

Durability
It can be divided into two parts: integrity and data availability.
Data integrity is ensured by: using digital hash sums when data is written, read, migrated, and periodically checked. Damaged objects are transparently recreated from copies. Erasure Coding geo-distributed mechanism allows you to economize on the use of space for storing copies of data.
Availability of data is ensured through Fault-tolerent architecture, support for uninterrupted operations, software updates and platform equipment. Load distribution, both during normal operation and in case of failure. NetApp AutoSupport can automatically notify support for pre-reactive problem resolution. Erasure coding at the node level improves the availability of each node, recovery time, impact on performance and network activity (available only on the E-Series platform with Dynamic Disk Pools).
NAS
A NAS functional with CIFS / NFS protocols can be implemented using a NAS bridge. This will not modify the existing infrastructure and provide standard file access to end users. StorageGrid, in turn, thanks to life cycle policies, can, based on meta-information (for example, the last modification time or file creation), transparently move this data across storage levels. Licenses for file bridges are included in the StorageGrid distribution, they do not need to be re-purchased. Integration with Active Directory and LDAP is supported. NAS bridge, as it were, “from above” provides access via CIFS / NFS and this is the most common NAS.
Bottom on the back end, the same NAS bridge is connected via an object protocol and converts files to objects, then they are already stored and processed as ordinary objects.
Security
Support for end-to-end encryption of each object and Secure Multi-Tenancy. Support for authentication and security mechanisms for S3 and CDMI. LDAP / AD integration for user authentication within one Tenant is supported.
Production-Ready
This is a very important point, when the customer has no army of programmers and administrators, it is important that the complex is reliable. StorageGrid technologies are more than 14 years old (in 2001, the first installation), and it managed to build a large number of integrations with other known products for backup, archiving, file synchronization, collaboration, etc.
- NTP Hierarchical storage management service: Software Object Storage & Cloud Connector (File Vacuum)
- Ctera File sync and share, collaboration
- Stealth Microsoft SQL / Exchange / SharePoint integration
- PoINT Hierarchical Storage Management Service
- Commvault Backup and archive
- Citrix Sharefile File sync and share, collaboration
- Egnyte File sync and share, collaboration
- SoftNAS General purpose NFS and CIFS gateway
- NetApp AltaVault (SteelStore) . The link will be available after the publication of the next article.
- Symantec Enterprise Vault with NetApp StorageGRID Adapter
- Amazon S3
- Amazon CloudFront
- Open Stack Swift with white box
- Inktank Ceph with Calamari
- Swift API
- OpenStack Glance Integration: Leverage StorageGRID Webscale as Glance image repository via S3 and Swift
- NetApp OpenStack Cinder driver
- Openstack Kilo
- OpenStack Heat orchestration
- Other.
StorageGrid Licensing Policy
The product is licensed per terabyte, regardless of the number and type of nodes. The hardware and software implementation of StorageGRID can coexist in the same cluster. All possible functionality is included in the basic delivery:
- In the case of hardware implementation, the product is licensed by the number of raw (RAW) terabytes
- In the case of purchasing a software license (without using StorageGRID hardware), the product is licensed by the amount of useful space, and the coefficient x1.25 is used.
findings
StorageGRID is a product for applications that support RESTful HTTP, which is suitable for large and small objects, high bandwidth, transactional and automatic, transparent movement of data across storage levels. Geo-clustering allows you to achieve incredibly high resiliency and data availability, hiding the failure of entire sites. EC technology allows you to significantly save space by using a RAID-like architecture. StorageGRID supports multiple storage tiers and is one of the most advanced data lifecycle management mechanisms. ILM will automatically move data when the price changes at a particular storage level, which will allow for more efficient use of resources and flexible response to changes in the cost of data storage (for example, in the Cloud or in Tape libraries). StorageGRID is an established, mature product with a broad list of third-party software integration, which simplifies support and integration into existing infrastructure. Encrypting objects and supporting LDAP / AD authentication helps protect against data theft. StorageGRID can be a complete replacement for Amazon S3, allowing you to rent to multiple companies for storage and acting as a private cloud for those who cannot place data in a public cloud. And it can be an addition to AWS S3, using it as a data storage level and has a mechanism for calculating the cost of their storage for StorageGRID storage tenants.
This may contain links to Habra articles that will be published later .
I ask to send messages on errors in the text to the LAN .
Comments, additions and questions on the article on the contrary, please in the comments .