⬆️ ⬇️

Storage of objects for the OpenStack cloud: Swift and Ceph comparison

Author: Dmitry Ukov



Overview





Many people confuse object-oriented storage with block storage, for example, based on iSCSI or FibreChannel (Storage Area Network, SAN), although in fact there are many differences between them. While the SAN system only sees block devices (a good example of a device name is / dev / sdb linux), access to the object storage can only be obtained using a specialized client application (for example, the box.com client application).

')

Block storage is an important part of the cloud infrastructure. The main ways to use it are to store images of virtual machines or to store user files (for example, backup copies of various types, documents, images). The main advantage of object storage is the very low implementation cost compared to enterprise-level storage, while ensuring scalability and data redundancy. There are two common ways to implement object storage. In this article, we will compare two methods, the interface to which provides OpenStack.



Openstack swift





Swift Network Architecture





OpenStack Object Storage (Swift) provides redundant, scalable distributed object storage that uses clusters of standardized servers. “Distribution” means that each piece of data is replicated across a cluster of storage nodes. The number of replicas can be configured, but it should be at least three for commercial infrastructures.



Access to objects in Swift is via the REST interface. These objects can be stored, retrieved or updated on demand. Object storage can be easily distributed across a large number of servers.



The access path to each object consists of three elements:



/ account / container / object



An object is a unique name that identifies an object. Accounts and containers provide a way to group objects. Attachment of accounts and containers is not supported.



Swift software consists of components, including account processing servers, container processing servers, and object processing servers that perform storage, replication, container management, and accounts. In addition, another machine called a proxy server provides the Swift API to users and transfers objects from and to clients on request.



Account processing servers provide container lists for a specific account. Container servers provide lists of objects in specific containers. Object processing servers simply return or store the object itself if there is a full path.



Rings





Since user data is distributed across a set of computers, it is important to track exactly where they are located. In Swift, this is achieved using internal data structures called “rings”. Rings are located on all Swift cluster nodes (both storages and proxies). Thus, Swift solves the problem of many distributed file systems that rely on a centralized metadata server when this metadata repository becomes a bottleneck for reference metadata. There is no need to update the ring to store or delete an individual object, since the rings reflect participation in clusters better than the central data map. This has a positive effect on I / O operations, which significantly reduces access time.



There are separate rings for the database of the account, the container, and the individual objects, but all the rings work in the same way. In short, for a given account, container, or object name, the ring returns information about its physical location on the storage node. Technically, this action is performed using the sequential hashing method. A detailed explanation of the algorithm of the ring can be found in our blog and this link .







Proxy server





The proxy server provides access to the public API interface and serves requests for storage entities. For each request, the proxy server receives information about the location of the account, container and object using the ring. After receiving the location, the server performs the request routing. Objects are transmitted from the proxy server to the client directly without buffering support (if it is even more precise: although the name has “proxy”, the “proxy” server does not perform “proxying” as such, for example, in http).



Object Processing Server





This is a simple BLOB (blob) repository where you can store, retrieve, and delete objects. Objects are stored as binary files in storage nodes, and metadata is located in extended file attributes (xattrs). Thus, it is necessary that the object server file system supports xattrs for files.



Each object is stored using the path obtained from the checksum of the file and the time stamp of the operation. The last entry always outweighs (including in distributed scenarios, which causes global clock synchronization) and ensures the maintenance of the latest version of the object. Deletion is also considered as a file version (a file of 0 bytes, ending in “.ts”, which means tombstone). This ensures proper replication of deleted files. In this case, old versions of files do not reappear upon failure.



Container handling server





The container handling server processes lists of objects. He does not know where the objects are located, only the contents of a specific container. Lists are stored as sqlite3 database files and replicated in a cluster like objects. Statistics are also tracked, including the total number of objects and the amount of storage used for this container.



A special process — swift-container-updater — constantly checks the container databases on the site it is running on and updates the account database when the container data changes. To find the account that needs to be updated, it uses the ring.



Account Processing Server





It works similarly to the container handling server, but it processes the lists of containers.



Properties and Functions





- Replication: the number of copies of objects that can be configured manually.



- The object loading is a synchronous process: the proxy server returns the HTTP code “201 Created” only if more than half of the replicas are recorded.



- Integration with the authentication service OpenStack (Keystone): accounts are assigned to the participants.



- Validation of data: the amount of md5 object in the file system compared with the metadata stored in xattrs.



- Container synchronization: it becomes possible to synchronize containers across several data centers.



- Transmission mechanism: it is possible to use an additional node to store the replica in case of failure.



- If the size of the object is more than 5 GB, it must be broken: these parts are stored as separate objects. They can be read at the same time.



Ceph





Ceph is a distributed network storage with distributed metadata management and POSIX semantics. You can access the Ceph object repository using various clients, including the dedicated cmdline tool, FUSE and Amazon S3 clients (using the compatibility level called “ S3 Gateway “). Ceph has a high degree of modularity - various sets of functions are provided by various components that can be combined and assembled. In particular, for the object storage, which is accessed using the s3 API, it is enough to run three components: the object processing server, the monitoring server, the RADOS gateway.







Monitoring server





ceph-mon is a lightweight workflow that provides consistency for distributed decision making in a Ceph cluster. This is also the starting point of contact for new customers, which provides information on the cluster topology. There are usually three ceph-mon workflows on three different physical machines, isolated from each other; for example, on different shelves or in different rows.



Object Processing Server





The actual data placed in Ceph is stored on top of the cluster storage engine called RADOS , which is deployed on a set of storage nodes.



ceph-osd is a storage workflow that runs on each storage node (object processing server) in a Ceph.ceph-osd cluster and is associated with ceph-mon to participate in the cluster. Its main purpose is to handle read / write requests and other requests from clients. It also interacts with other ceph-osd processes to replicate data. The data model is relatively simple at this level. There are several named pools, and within each pool there are named objects in the flat namespace (without directories). Each object has data and metadata. Object data is one potentially large series of bytes. Metadata is an unordered set of key-value pairs. The Ceph file system uses metadata to store information about the owner of the file, etc. Under it, the ceph-osd workflow stores data in the local file system. We recommend Btrfs, but any POSIX file system with extended attributes is appropriate.



CRUSH algorithm





While Swift uses rings ( matching the range of md5 checksums with a set of storage nodes) for sequential distribution and retrieval of data, Ceph uses an algorithm called CRUSH for this. In short, CRUSH is an algorithm that can calculate the physical location of data in Ceph based on the object name, cluster map, and CRUSH rules. CRUSH describes a storage cluster in a hierarchy that reflects its physical organization, thus ensuring correct replication of data over physical equipment. In addition, CRUSH allows you to manage data placement with a policy, which allows CRUSH to respond to changes in cluster participation.



Rados Gateway





radosgw is a FastCGI service that provides the RESTful HTTP API for storing objects and metadata on a Ceph cluster.



Properties and Functions



- Partial or full read and write operations



- Snapshots



- Atomic transactions for functions such as adding, truncating, and range cloning



- Key-value mapping at the object level



- Manage replica objects



- Aggregation of objects (series of objects) into a group and correlation of a group of OSD series



- Authentication using shared secret keys: both the client and the monitoring cluster have a copy of the client's secret key

- Compatibility with API S3 / Swift



Feature Overview





SwiftCeph
ReplicationYesYes
Maximum object size5 GB

(larger objects are segmented)
Is not limited
Multi DC installation (distribution between data centers)Yes (replication only at the container level, but a scheme is proposed for full replication between data centers)No (requires asynchronous subsequent replication of data integrity, which Ceph does not yet support)
Openstack integrationYesPartial

(lack of keystone support)
Replica managementNotYes
Write algorithmSynchronousSynchronous
Amazon S3 Compatible APIYesYes
Data placement methodRings (static mapping structure)CRUSH (algorithm)




Sources





Official Swift documentation - The source for the description of the data structure.

Swift Ring source code on Github - Base source code for Swift Ring and RingBuilder classes.

Blog of Chmouel Boudjnah - Useful tips for using Swift.

Official Ceph documentation - The main source for data structure descriptions.



Original article in English

Source: https://habr.com/ru/post/176195/



All Articles