MemcacheDB and MemcacheQ are key components of a high-performance infrastructure

Today we will talk about the components for a high-performance and scalable memcached server-based architecture, namely, a distributed database for storing MemcacheDB data and MemcacheQ message queue system.

First, consider what we have available for creating a distributed storage infrastructure for a web application. Well, the first thing that comes to mind is database clustering; it is now supported in all common systems, as well as various replication technologies. For example, the most popular DBMS for web projects, MySQL supports both replication and clustering. You can also refer to the traditional file system and store data in the file system, for example, Apache Hadoop . But often this is too high-level solution, usually the options are much simpler - when it is necessary to store and operate just key-value pairs. If you look seriously, this functionality will cover the needs of 90% of web applications. And if we add to this the ability to operate very, very quickly with data, store them in the form of a distributed multi-server system and the possibility of permanent storage that is resistant to failures - we get a very attractive platform.
')

Memcached has long been known as a data caching server that is used on many high-load projects, including Wikipedia and LiveJournal, and allows you to cache any data in memory and quickly operate on it, while only the simplest operations are supported, this is clearly not a complete database. And the use of memory as a data storage is ideal for the case of data caching, but if it comes to reliability or fault tolerance, then the work is shifted to the server itself and the equipment.

That's to solve all these issues, to combine high speed, simple interface and the principles of memcached, and the reliability of conventional databases, and MemcacheDB was developed. This is a system of distributed data storage in the form of key-value pairs, which is compatible with the memcached API, which means that any client can work transparently with both the cache and the data storage without even noticing it. But, unlike the cache, the memcacheDB data is stored on disk - BerkeleyDB’s embedded industrial database is used as a backend and all features are used to ensure efficient and reliable storage, in particular, transactionality and replication.

According to the speed of access to data, memcacheDB is at the same memcache level and is comparable to specialized databases, for example, CouchDB , and in numbers this amounts to tens of thousands of write and read operations per second (and here is a benchmark , as well as a comparison with CouchDB ). The developers themselves warn that memcacheDB is not a cache, so you should not use it everywhere as a replacement for memcached itself, it just implements a different storage strategy with compatible access, similar to memcached.

Despite the simplest operations with data - writing, reading, updating and deleting, such functionality is often enough for most tasks, where we are accustomed to using regular databases. But if you transfer part of the operations to specialized solutions, this will significantly relieve the main base for operations that already require serious means of working with data. For example, an increment / decrement variable command is supported, which will allow implementing various counters and statistics without accessing the database, while the system will be able to serve thousands of clients in real time.

MemcacheDB is easy to deploy - install and compile from source, install the database (it does not require administration) and that's it. Simply configure client access parameters — a port and several other parameters that affect performance, for example, the size of the data buffer, the directory for storing the database, the size of the cache in memory. Do not think that all read operations come from the disk, since the system uses the database as a file on the disk, of course, caching is also used, which allows you to compare in speed with the original memcached, while also ensuring the reliability of storage.

The most interesting feature of memcacheDB is the ability to work on multiple servers using replication to exchange data and synchronize databases. At the same time, memcacheDB can use several replication strategies, depending on your needs, to guarantee data integrity or ensure speed. The main model of a distributed infrastructure supported by the system is one master server and several slave slave servers that are used only for reading data.

In the case of multiple servers, the system can use the following replication strategies:

DB_REPMGR_ACKS_ALL - the master server is waiting for confirmation of successful data recording from all other servers;
DB_REPMGR_ACKS_ALL_PEERS - the master expects a response from all the slave servers, which are, in turn, master servers for their groups (multi-level system);
DB_REPMGR_ACKS_NONE - do not expect any confirmation from other servers. The highest speed, but there are no guarantees that there are copies of data on other servers besides the master.
DB_REPMGR_ACKS_ONE - waiting for confirmation of at least one of the servers;
DB_REPMGR_ACKS_ONE_PEER is the same as in the previous case, but confirmation is expected from the server, which in turn is the master server for its group.
DB_REPMGR_ACKS_QUORUM - we expect confirmation from a certain minimum number of servers that guarantee data integrity and are also master servers for their groups. This strategy is used by default.

For now, it is necessary that all servers in a group use the same replication strategy, but I am not sure that this concerns complex systems, where there are many groups — in the end, you can configure it so that the request will be distributed first to one server group with one strategy, and then the data is distributed according to another algorithm, while the root server is not even aware of this.

Logging and backup of the database files themselves, including hot backup, are separately supported, but this is a separate conversation and the specifics of installation and use.

And so, we have the opportunity to organize ... such a service Amazone S3, while somehow distributed, fast and reliable, with a simple and clear universal API. There are many applications to such a system; in virtually every high-loaded project, you can transfer part of the logic from the database to such storage system and obtain high fault tolerance and ensure load relief for a lot of simple queries from the main database.

The second project, also based on memcached code, is memcacheQ . This is a message queuing system that has an even simpler API and supports only two commands, read and write. The message queue is a named stack where messages can be written, and the client, by specifying the name of the queue, can receive all messages from the queue at any other time. The maximum message size is 64Kb, and the data itself is stored in the BerkeleyDB database itself, which means the same data integrity, replication and other features are provided. Such a system can be used to build systems of communication between users within the project, mail systems, chat and other similar ones, where such functionality multiplied by greater speed and reliability is required.

These two projects, MemcacheDB and MemcacheQ are quite simple in terms of the external interface and seemingly limited in capabilities, but at the same time they allow you to build very powerful and highly loaded projects on your basis if you take them into account at the design stage. For many projects, this will allow abandoning or reducing the burden on expensive resources in the form of a database and ensuring high resiliency and flexibility.

Source: https://habr.com/ru/post/41464/

All Articles

MemcacheDB and MemcacheQ are key components of a high-performance infrastructure

More articles: