Introduction
Colleagues, when developing applications, every day we are faced with the need for flexible storage of information (updating, searching for it, etc.). The class of products that solve this range of tasks, as we all know - Databases. But what is it in our understanding? For many, the “database” is firmly associated with MySQL, tables and SQL queries. And it suits up to a certain point. Indeed, relational databases provide a lot of advantages in work: since the data is strongly connected, there is no need to monitor the integrity of the database. Using a simple sub-request, you can select the number of comments for each blog post. Using JOIN it is easy to make complex linked samples and retrieve data on several entities at once.
Scaled, scaled, but not scaled
However, when you suddenly realize that one server is no longer enough, and you want to scatter the base on several physical machines, the first thing offered is master-slave replication, in which the recording goes to one machine, and reading from several. But with a sufficiently large number of changes in the near future, master will only do what to distribute logs, and you will have to resort to sophisticated configuration so that each node has one single master server and can have several slave servers. This scheme is very difficult to implement and contains single point of failure (that is, when a single server fails, all its subordinates quickly turn into dinosaurs). Again, all these tricks will increase only the number of readings. And if for making changes to the state of the database there is not enough power of one server, then you have to distribute the load by storing different tables on different servers. But then connectedness will be lost. Or you have to break the table into several parts, storing them in different places, according to a given law (for example, by ID), but this will take the JOIN charms to the grave. The further we try to scale relational databases, the more convenience we pay for it. For example, when using master-master, we will pay with auto increments.
Will MemcacheDB and Redis save us?
Similar key-value solutions have been around for quite some time. In my opinion, MemcacheDB is an odd job parasitizing on the good name of a wonderful product: the data there is absolutely not interconnected, we can only perform operations on the value knowing the key. I spent a lot of time writing tools that allow you to work profitably with a key-value database like MemcacheDB, and it even works, but I came to the conclusion that simple urgent tasks are solved so crookedly that you involuntarily think in the direction of the relational: object life (TTL), i.e. You can not take and delete sessions older than a month. It does not deliver.
The authors of Redis went a little further in their perverted fantasies and made atomic lists and set'y in the keys, but not much progress in facilitating the life of the programmer.
')
Moreover, it is still necessary to manually maintain the key cache, i.e. save the necessary keys in Memcached and take them out of it, which creates a lot of synchronization problems. At the same time, atomicity of operations is also absent: there is a race between receiving an object and writing to the cache, CAS demotivates us with its performance.
What if we want a catalog of products by type of Yandex.Market?
Suppose we want to make a directory in which the search is performed by parameters such as price, rating, zoom ratio of the camera and the number of gears of the bicycle. Suppose we use a relational database like MySQL. What should we do? We must either create a table of goods in which there will be both the multiplicity of the zoom and the number of gears for the bike, and for example, an indicator of the flexibility of the bristles of the toothbrush (in this case we will rest on the field limit for the table, or lose and convenience), or we have to get a label like good_id, key, value and make terrible JOINs for sampling and searching, without stammering about scaling.
You can still incite Sphinx or something worse than this, but it will most likely look like hammering nails with a laptop.
Relational has a serious birth injury.
MongoDB speak?
MongoDB - sharp as diarrhea object-documentary database. Ideologically, this is a kind of symbiosis between a familiar relational database and key-value storage, and, in my opinion, very successful. On the one hand, it allows you to perform very fast operations on an object, knowing its identifier, and on the other hand, it provides a powerful tool for complex interactions.
A collection is a named set of objects, with one object belonging to only one collection.
An object is a collection of properties, including the unique identifier _id.
A property is a collection of a name and its corresponding type and value.
Property types - string, integer, floating point number, array, object, binary string, byte, character, date, boolean, null.
Supports select operations (count, group, MapReduce ...), inserts, changes and deletes. There are no connections between objects, objects can only store other objects in properties. Both unique and composite indexes are supported. Indices can be imposed on the properties of nested objects.
Replication is supported (even implied), fail-over is implemented.
Implemented MapReduce and sharding.
Since objects can have an arbitrary set of properties, it is enough for the catalog to create a collection of goods and put objects there. In this case, the search will be conducted by index.
The sharpness of MongoDB is pronounced on insert, they happen well very quickly. By the way, it's nice that the storage format and the format for transferring objects over the network is the same, so to select an object you just need to find its position by index and return a piece of a file of a certain length to the client - no abstraction over the storage engine.
The unique identifier is not the auto-increment field, but the 12-byte unique number generated on the client. Thus, firstly, there is no problem with replica synchronization, i.e. You can independently insert into two different machines, and there will be no conflict. Secondly, there will be no nonsense with an overflow of an integer, well, after the re-creation of the database, the search engines will not address the new articles on the old links.
MongoDB + Memcached? Better than Bonnie and Clyde!
We have more or less figured out the database, now we’ll think about how we can cache the whole thing, because we want to give hot cakes with Memcached speed!
First, the caching of the objects themselves. As many had time to guess, or learn from experience - in most cases, the operation of selecting an object by ID is the most frequent. For example, a sample of a user object, a sample of this post from the Habrahabr base (amen), etc.
As we have already found out above, it is unwise to shift this work to the application, since we would have to build a garden of distributed locks. Let's go the other way, write an asynchronous application that connects to MongoDB and, pretending to be a slave, will receive a change log, and reset the changes to Memcached (if the object contains a key in the _key property). Since the application is asynchronous, this will happen quickly, but the race will not occur. It's easy to write, my implementation is
here . Moreover, I also added sending changes to the event server there, so I just have to change the object from anywhere (at least from the console), as it will be immediately transferred to all clients subscribed to it.
Query caching and cache devalidation should work a little differently. There are several approaches:
- Cache for a specific time. If the request is the same or not so much, then you can update the cache no more than once per second - this will significantly reduce the load.
- Delete cache on demand of the application. The approach is rather dreary, but having the right to life.
- Use the lock service to not do one job twice. Definitely - it's good.
Cons to be faced.
- MongoDB is a rather young product, and there are bugs in it (it happens Segmentation fault, core dumped), new features appear, etc. I use it in production, but with caution. However, the undoubted advantage is the high rate of development (the project is written not only by volunteers, but also by a company of full-time people), so you can count on a quick bugfix, help in solving your problems and implementing your ideas (unless of course they are good). Commercial support is also available.
- Overhead for storing property names.
- By default, the maximum object size is 4 megabytes.
- On 32-bit machines, the maximum size of a single database is 2 gigabytes .
Finita la comedia!
Project page -
MongoDB .
Found benchmark
MongoDB vs MySQL .
This article risks starting the MongoDB cycle. Next time I will talk in more detail about sharding, MapReduce, and give a living example.
Friends, thank you for your attention! I will be glad to constructive comments.
Published continued:
"MongoDB - make good coffee .
"