📜 ⬆️ ⬇️

March against RDBMS or distributed storage projects (key-value stores)

candybar2 Do you often create projects? And, probably, everywhere you use a database, in particular, MySQL (and someone else PostgreSQL). But what is interesting is that, from experience and just after reading the description of various architectures, it is clear that key features of databases are not needed everywhere in the project, in many cases the database is used simply as some kind of ordinary data storage. For example, in database caching systems usually do not apply, moreover, caching is just used in order to avoid unnecessary requests. And what is used for caching most often? Memcached And what is it? This is a hash table based storage system. In general terms, this is simply a repository of key-value pairs, on which only basic operations can be performed - write, read, delete, and check for presence. Yes, yes, there are no filters, selections, sorting, the most maximum is the tag system for selecting all the related records in one query. And in many cases, this functionality is quite enough.

I’m not a fanatic at all, and in real projects the best one would be a combination of a regular, relational database and specialized data storage. More advanced systems that store not just key-value pairs, but additional meta-information about an object, are already approaching database capabilities; they are sometimes called document-oriented databases (repositories), because the information unit on which work is being done, is the document and its associated data.

The second criterion or feature is distribution. For a DBMS, this is often difficult enough or with the help of third-party tools. Data warehouses are built on the basis of DHT ( Distributed Hash Table ) and are initially ready for distributed work, providing scalability and resistance to failure of individual nodes. In some systems, this is solved at the expense of the environment (for example, if the storage runs on Erlang VM ), the latter use the built-in distributed work tools (for example, JGroups for Java systems), or their own solutions, like Memcached .
')
Also important is the complete readiness of such systems to work in the Cloud-environment, it is not for nothing that such a storage works for Amazon (S3 and SimpleDB). The well-known BigTable from Google also, for the most part, is just a system for storing and processing key / value pairs. Because of the simplicity and even triviality of the API (but not always the internal device, although it is simpler than standard SQL DB) solutions scale well (both for reading and writing), including dynamically, without interruption in work . So if you have or will be a cluster, take a closer look at such solutions. But there is one thing that is worth mentioning - very often such systems work only with data storage in memory, if permanent storage is required, back-end systems are used, including storage in a conventional relational database, although this can often impose restrictions on data and their parameters (and also slows down the work).

Why can this be applied? Yes, wherever you need to store a large (almost unlimited) amount of data that can be broken up into separate independent blocks. These can be individual articles, photos, videos or other large binary objects, log entries, user profiles, session data (by the way, we previously announced our experimental open source development, Java session server for distributed storage of PHP application sessions, there is a similar solution in industrial Zend Platform ). In most cases, everything is limited either to a set of binary data, or a text string with data or code in a serialized form, so the data can be either further used in the processing program, or immediately sent to the client - this is what the Nginx plugin does, which looks in Memcached and, there is the requested content, gives directly, bypassing the general appeal to your script. Now, for example, I am designing a chat server, where a distributed cache will be used as the main data storage (a Java system using a cache with replication via JGroups), which is essentially the same data storage in the form of a key and a value.

Okay, enough theory, let's see what storage systems are on the market (of course, open source).

The list did not include several more systems - for example, Hadoop HBase , Cassandra , Hypertable , Dynomite , Kai , Ringo .

It is interesting to note that mainly for such systems either specialized languages ​​and platforms are used (Erlang is almost out of competition here) or serious systems like Java that have already become classic and mainstream, and only in rare cases are they based on their own C / C ++ developments.

Develop a high-performance system, not necessarily the web? Do you need a specific data storage, while you want to receive it in the simplest way, scale “upside down”, without even stopping work for a second? There may be a lot of data, but they are all simple and come down to strings or serialized structures and binary blocks? Need reliable data storage, distributed and fault tolerant? If at least one of these questions answers “yes”, you should look at at least a couple of projects from the list, perhaps they will allow your project to withstand the load and confidently develop.

PS The original article , which pushed me to write - there is a good comparative table of systems.

Source: https://habr.com/ru/post/55077/


All Articles