We're glad to announce the news: we rewrote the cloud storage API. Now everything works much more stable and faster thanks to the new platform - Hummingbird, which is essentially the implementation of some components of OpenStack Swift on Go. How we implemented
Hummingbird and what problems we managed to solve with its help, we will describe in this article.
Object Storage Model OpenStack SwiftThe object storage model of OpenStack Swift includes several integrated entities:
')
- A proxy server that receives a certain set of data from the end user, performs service requests to other components of the storage and, finally, generates and sends the correct response.
- account and container servers. Here I combine these two services, because they are very similar in the way they work - each of them saves metadata, as well as lists of containers (server of accounts) and files (server of containers) in separate sqlite-databases and outputs these data by queries.
- object servers, which, in fact, store user files (and file-level metadata in extended attributes). This is the most primitive level of abstraction - here there are only objects (in the form of files) that are written to certain partitions.
Each of these components is a separate daemon. All daemons are clustered using the so-called
ring - hash table to determine the location of objects within the cluster. The ring is created by a separate process (ring builder) and distributed to all nodes of the cluster. It sets the number of replicas of objects in a cluster for fault tolerance (it is usually recommended to keep three copies of each object on different servers), the number of partitions (internal structure of the swift), the distribution of devices by zones and regions. The ring also provides a list of so-called handoff devices to which data will be uploaded if the main devices are unavailable.
Consider all the components of the storage in more detail.
ProxyInitially, we used the standard swift-proxy, then, when the load increased, and our own code became more - we transferred all this to
gevent and
gunicorn , and later replaced gunicorn with uwsgi, since the latter works better under heavy loads. All these solutions were not particularly effective, the waiting time associated with the proxy was quite large and more servers had to be used to process authorized traffic, since Python itself is very slow. As a result, all this traffic had to be processed on 12 machines (now all traffic — both public and private — is processed on only 3 servers).
After all these palliative actions, I rewrote the proxy server to go. The prototype from the Hummingbird project was taken as a basis; then, I added middleware that implement our entire user functionality - it is authorization, quotas, layers for working with static sites, symbolic links, large segmented objects (dynamic and static), additional domains, versioning, etc. In addition, we have implemented separate endpoints for the operation of some of our special functions - this is setting up domains, ssl-certificates, sub-users. We use justinas / alice (https://github.com/justinas/alice) as a means for forming middleware chains, and
gorilla / context is used to store global variables in the context of a query.
To send requests to OpenStack Swift services, the directclient component is used, which has full access to all components of the repository. Among other things, we actively use caching of metadata of account, container and object levels. This metadata will be included in the context of the request; they are needed to make a decision on its further processing. In order not to perform too many service requests to the repository, we keep this data in the memcache cache. Thus, the proxy server receives the request, forms its context and passes through various layers (middleware), one of which should say: “This request is for me!”. It is this layer that will process the request and return the answer to the user.
All unauthorized requests to the repository are first passed through a caching proxy, which we chose as Apache Trafficserver.
Since standard caching policies imply quite a long time finding an object in the cache (otherwise the cache is useless), we have made a separate daemon to clear the cache. It accepts PUT request events from the proxy and clears the cache for all names of the changed object (each object in the repository has at least 2 names: userId.selcdn.ru and userId.selcdn.com; even users can attach their domains to containers which is also required to clean the cache).
Accounts and containersThe storage accounts layer looks like this: a separate sqlite database is created for each user, which stores a set of global metadata (account quotas, keys for TempURL, and others), as well as a list of containers for this user. The number of containers per user is not in the billions, so the size of the databases is small, and they replicate quickly.
In general, accounts work fine, there are no problems with them, for which all progressive humanity loves them.
With containers, the situation is different. From a technical point of view, they are exactly the same sqlite databases, but these databases no longer contain tiny lists of containers for a single user, but metadata of a certain container and a list of files in it. Some of our customers store up to 100 million files in one container. Requests for such large sqlite databases are slower, and with replication (given the presence of asynchronous records), the situation is much more complicated.
Of course, sqlite storage for containers could be replaced with something — but with what? We did test versions of servers of accounts and containers based on MongoDB and Cassandra, but all such solutions “tied” to a centralized base can hardly be called successful from the point of view of horizontal scaling. Over time, there are more and more clients and files, so storing data in numerous small databases is preferable to using one hefty database with billions of records.
It would not be superfluous to implement an automatic sharding of containers: if it were possible to break huge containers into several sqlite databases, it would be great in general!
About sharding you can read more
here . As you can see, everything is still in the process.
Another function directly related to the container server is the expiring of objects. Using the X-Delete-At, or X-Delete-After headers, you can specify a period of time after which any object storage object will be deleted. These headers can be equally transmitted both when creating an object (PUT request) and when changing the metadata thereof (POST request). However, the current implementation of this process does not look as good as we would like. The fact is that initially this feature was implemented in such a way as to make as few corrections as possible to the existing infrastructure of OpenStack Swift. And here we went by the simplest way - we decided to place the addresses of all objects with a limited shelf life in a special account “.expiring_objects” and periodically view this account using a separate daemon called object-expirer. After that, we had two additional problems:
- The first is a business account. Now, when we make a PUT / POST request to an object with one of the specified headers, containers are created on this account, whose name is a timestamp (unix timestamp). In these containers, pseudo-objects are created with the name in the form of a timestamp and a full path to the corresponding real object. To do this, use a special function that accesses the container server and creates an entry in the database; the object server is not involved at all. Thus, the active use of the function of limiting the shelf life of objects increases the load on the container server at times.
- The second is associated with the object-expirer daemon. This daemon periodically runs through a huge list of pseudo-objects, checking timestamps and sending requests to delete expired files. Its main drawback is the extremely low speed of work. Because of this, it often happens that the object is actually deleted, but it still appears in the list of containers, because the corresponding entry in the database of containers is still not deleted.
In our practice, a typical situation is when there are more than 200 million files in the queue for deletion, and object-expirer cannot cope with its tasks. So we had to make our own daemon written in Go.
There is a
solution that has been under discussion for quite some time and, I hope, will be implemented soon.
What is it? Additional fields will appear in the container database schema that will allow the container replicator to delete expired files. This will solve problems with the presence in the container database of records about objects already deleted + object auditor will delete the files of expired objects. This will also allow to completely abandon the object-expirer, which at the moment is the same atavism as the multi-wax.
ObjectsThe object layer is the easiest part of OpenStack Swift. At this level, there are only byte sets of objects (files) and a specific set of operations on these files (write, read, delete). To work with objects, a standard set of daemons is used:
- object server (object-server) - accepts requests from the proxy server and places / gives objects to the real file system;
- object replicator (object-replicator) - implements replication logic over rsync;
- object-auditor (object-auditor) - checks the integrity of objects and their attributes and places damaged objects in quarantine so that the replicator can restore the correct copy from another source;
- corrector (object-updater) - designed to perform deferred operations update the database accounts and containers. Such operations may appear, for example, due to timeouts associated with blocking sqlite databases.
Looks pretty simple, right? But, this layer has several significant problems moving from release to release:
- slow (very slow) replication of objects over rsync. If in a small cluster you can come to terms with this, then after reaching a couple of billions of objects, everything looks very sad. Rsync assumes the push model of replication: the object-replicator service scans files on its node and tries to push these files to all other nodes where this file should be. Now it is no longer such a simple process - in particular, additional hashes of partitions are used to increase speed. Read more about all this here.
- periodic problems with the object server, which can easily be blocked when an I / O operation is hanging on one of the disks. This is manifested in sharp jumps in response time to the request. Why it happens? In large clusters, it is not always possible to achieve the ideal situation when absolutely all servers are running, all disks are mounted, and all file systems are available. Disks in object servers periodically “freeze” (for objects, regular hdd with a small limit on iops are used), in the case of OpenStack Swift, this often leads to temporary unavailability of the entire object node, since In the standard object server, there is no mechanism for isolating operations to a single disk. This leads, in particular, to a large number of timeouts.
Fortunately, an alternative to Swift has recently appeared, which allows to solve all the problems described above quite effectively. This is the same Hummingbird, which we have already mentioned above. In the next section, we will talk about it in more detail.
Hummingbird as an attempt to solve Swift problemsMore recently, Rackspace began reworking OpenStack Swift and rewriting it for Go. Now it is almost ready to use a layer of objects, including a server, a replicator and an auditor. So far, there is no support for storage policies, but they are not used in our storage. Of the demons associated with objects, there is only a corrector (object-updater).
In general, hummingbird is a feature branch in the official OpenStack repository, this project is being actively developed and will soon be included in the master (possibly), you can participate in the development -
github.com/openstack/swift/tree/feature/hummingbird .
How is Hummingbird better than Swift?
First, the replication logic has changed in Hummingbird. The replicator bypasses all files in the local file system and sends requests to all the nodes: “Do you need this?” (To simplify the routing of such requests, the REPCONN method is used). If the response states that somewhere there is a newer version of the file - the local file is deleted. If this file is missing somewhere, a missing copy is created. If the file has already been deleted and somewhere there is a tombstone file with a newer timestamp, the local file will be deleted immediately.
Here it is necessary to clarify what a tombstone file is. This is an empty file that is placed in place of the object when it is deleted.
Why are these files needed? In the case of large distributed repositories, we cannot guarantee that when sending a DELETE request, absolutely all copies of the object will immediately be deleted, because for such operations we send the user a response about the successful deletion after receiving n requests, where n corresponds to the quorum (in in case of 3 copies - we should get two answers). This is intentional, as some devices may not be available for various reasons (for example, planned work with the equipment). Naturally, the copy of the file on these devices will not be deleted.
Moreover, after the device returns to the cluster, the file will be available. Therefore, when deleting, the object is replaced with an empty file with the current timestamp in the name. If two tombstone files are found during the polling of the replicator servers, and even with newer timestamps, and one copy of the file with the older last-modified file means that the object has been deleted, and the remaining copy must also be deleted.
The object server is the same: the disks are isolated, there are semaphores that limit competitive connections to the disk. If a disk for some reason or other “freezes” or overflows the queue of requests to it, you can simply go to another node.
The Hummingbird project is developing successfully; let's hope that very soon it will be officially included in OpenStack.
We transferred the whole cloud storage cluster to Hummingbird. Due to this, the average waiting time for a response from the object server has decreased, and the errors have become much less. As noted above, we use our own proxy server based on a prototype from Hummingbird. The object layer is also replaced with a set of demons from the Hummingbird.
Of the components of the standard OpenStack Swift, only daemons associated with the account and container layers are used, as well as the demon-corrector (object-updater).
ConclusionIn this article, we talked about Hummingbird and what problems we managed to solve with its help. If you have questions, we will be happy to answer them in the comments.
Thanks to the transition to a new platform, we were able to significantly expand the range of capabilities of the storage API. There are new features: user management, domains, SSL certificates and much more. In the near future, we will post documentation on the updated API in the control panel, and we will tell about all the innovations in detail in a separate article.