All good startups either die quickly or grow to the need to scale. We will model such a startup, which is first about features, and then about performance. We will improve performance with MongoDB - this is a popular NoSQL solution for data storage. It is easy to start with MongoDB, and many problems have out-of-the-box solutions. However, when the load grows, a rake comes out, about which nobody warned you beforehand ... until today!

The modeling is carried out by
Sergey Zagursky , who is responsible for the backend infrastructure in general, and MongoDB in particular, in
Joom . Also was seen in the server side of the development of MMORPG Skyforge. As Sergey describes himself - “a professional taper of cones with his own forehead and rake”. Under the microscope, a project that uses an accumulation strategy to manage technical debt. In this text version of the
report on HighLoad ++, we will move in chronological order from the appearance of the problem to the solution using MongoDB.
First difficulties
We simulate a startup that fills cones. The first stage of life - features are launched in our startup and, unexpectedly, users come. Our small-small MongoDB server is under load, which we didn’t even dream of. But we are in the cloud, we are a startup! We do the simplest things possible: we look at requests - oh, and here we have the whole correction is read for each user, then we will build the indices, we will add the hardware, and here we will cache.
All - we live on!
')
If problems can be solved by such simple means - they should be solved in the same way.
But the further way of a successful startup is a slow, painful delay of the moment of horizontal scaling. I will try to give advice on how to survive this period, get to scale and not step on a rake.
Slow recording
This is one of the problems you can face. What to do if you met her, and the methods above do not help? Answer:
durability guarantee mode in MongoDB by default . In three words, it is arranged as follows:
- We came to the primary replica and said: "Write!".
- Primary replica recorded.
- After that, they read the secondary replicas from it and said to the primary: “We recorded it!”.
At that moment, when most secondary replicas have done this, the request is considered completed and control is returned to the driver in the application. Such guarantees allow you to be sure that when control is returned to the application, durability will not go anywhere, even if MongoDB falls, except for some very terrible disasters.
Fortunately, MongoDB is a database that allows you to reduce the durability guarantee for each individual request.
For important requests, we can keep the maximum guarantees for durability by default, and for some requests - reduce.
Query classes
The first guarantee layer we can remove is
not to wait for confirmation of the record by the majority of replicas . This will save latency, but will not add bandwidth. But sometimes latency is what you need, especially if the cluster is slightly overloaded and the secondary replicas are not working as fast as we would like.
{w:1, j:true}
If we write records with such guarantees, then at the moment when we get control to the application, we no longer know whether the record will be alive after some kind of accident. But, usually, she is still alive.
The next guarantee, which affects bandwidth and latency too, is
disabling confirmation of the entry in the log . Logging is written in any case. The magazine is one of the fundamental mechanisms. If we turn off the confirmation of entry into it, then we do not do two things:
fsync on the log and
do not wait for it to end . This can well
save disk system resources and get a
multiple increase in throughput by simply changing the durability warranty.
{w:1, j:false}
The most “tough” durability guarantees is the
disconnection of any evidence . We will only receive confirmation that the request reached the primary replica. This saves latency and does not increase throughput.
{w:0, j:false} — .
We will also receive various other things, for example, the recording failed due to a conflict with a unique key.
To which operations is this applicable?
I'll tell you about the application to the setup in Joom. In addition to the load from users, in which there are no durability eases, there is a load that can be described as a background batch load: updating, recalculating ratings, collecting analytics data.
These background operations can go on for hours, but they are designed so that when a breakaway occurs, for example, a backend crash, they will not lose the result of all their work, but will resume from a point in the recent past. For such tasks, reducing the durability guarantee is useful, especially since fsync in the log, like any other operation, will increase latency also for reading.
Read scaling
The next problem is
insufficient reading throughput . Recall that in our cluster there are not only primary replicas, but also secondary ones
from which to read . Let's do it.
You can read, but there are nuances. From secondary replicas will come a little outdated data - for 0.5-1 seconds. In most cases this is normal, but the behavior of the secondary replica is different from the behavior of the primary replicas.
On secondary, there is a process of applying oplog, which is not on the primary replica. This process is not something that is designed for low latency - MongoDB developers just do not bother with this. Under some conditions, the process of applying oplog from primary to secondary can give delays of up to 10 s.
For user requests, secondary replicas are not suitable - user experiences a cheerful step goes into the trash can.
On non-hardened clusters, these spikes are less noticeable, but they are still there. Shardirovannye clusters are affected, because the removal of oplog is particularly affected by the use of oplog, and
deletion is part of the work of the balancer . Balancer relish, tastefully removes documents in tens of thousands in a short period of time.
Number of connections
The next factor to think about is the
limit on the number of connections on MongoDB instances . By default, there are no restrictions,
except for OS resources — you can connect as long as it permits.
However, the more parallel concurrent requests, the slower they are executed.
Performance degrades nonlinearly . Therefore, if the spike of requests has arrived to us, it is better to serve 80% than not to serve 100%. The number of connections should be limited directly to MongoDB.
But there are bugs that can cause trouble for this. In particular, the
connection pool on the MongoDB side is common for both user and service intracluster connections . If the application has "eaten" all the connections from this pool, then the integrity of the cluster can be broken.
We learned about this when we were going to rebuild the index, and since we needed to remove the uniqueness from the index, the procedure went through several stages. In MongoDB it is impossible to build next to the index the same, but without uniqueness. Therefore, we wanted:
- build a similar index without uniqueness;
- delete index with uniqueness;
- build an index without uniqueness instead of a remote one;
- remove temporary.
When the temporary index was still being completed on secondary, we began to delete the unique index. At this point, the secondary MongoDB announced its blocking. Some metadata were blocked, and in the majority all the records stopped: they hung in the
connection pool and waited for them to confirm that the record passed. All readings on the secondary also stopped, because the global log was captured.
A cluster in such an interesting state has also lost its connectedness. Sometimes it appeared and when two cues were connected to each other, they tried to make in their state a choice that they could not make, because they had a global lock.
Moral of the story: the number of connections must be monitored.
There is a well-known MongoDB rake, which is still being attacked so often that I decided to walk along it briefly.
Do not lose documents
If in MongoDB to send a request by index, then the
request may not return all documents that satisfy the condition, and in completely unexpected cases. This is due to the fact that when we go to the beginning of the index, the document, which at the end, moves to the beginning for the documents that we have passed. This is due solely
to the index mutability . For reliable iteration, use
indexes on non-stable fields and there will be no difficulties.
MongoDB has its own views on which indexes to use. It is solved simply -
with the help of $ hint, we force MongoDB to use the index that was specified .
Sizes of collections
Our startup is developing, there is a lot of data, but I don’t want to add disks — they have already been added three times in the last month. Let's see what is stored in our data, look at the size of the documents. How to understand where in the collection you can reduce the size? According to two parameters.
- By the size of specific documents , to play
Object.bsonsize()
with their length: Object.bsonsize()
;
- By average document size in collection :
db.c.stats().avgObjectSize
.
How to affect the size of the document?
I have non-specific answers to this question. The first - the
long field name increases the size of the document. In each document, all the field names are copied, so if the document has a long field name, then the size of the name needs to be added to the size of each document. If you have a collection with a huge number of small documents in several fields, then call the fields short names: "A", "B", "CD" - a maximum of two letters.
On disk, this is compensated by compression , but everything is stored in the cache as is.
The second tip - sometimes
some fields with low cardinality can be rendered in the name of the collection . For example, such a field could be a language. If we have a collection with translations into Russian, English, French and a field with information about the stored language - the value of this field can be put into the name of the collection. So we will
reduce the size of the documents and can
reduce the number and size of indices - a solid savings! This is not always possible to crank, because sometimes there are indexes inside the document that will not work if the collection is smashed into different collections.
Last tip on document size -
use the _id field . If there is a natural unique key in your data, place it directly in the id field. Even if the key is composite, use the compound id. It is well indexed. There is only one small rake - if your marshaller sometimes changes the order of fields, then id with the same field values, but with different order will be considered different id in terms of a unique index in MongoDB. In some cases, this can happen in Go.
Index sizes
The index stores a copy of the fields that it contains . Index size consists of data that is indexed. If we are trying to index large fields, then get ready for the fact that the size of the index will be large.
The second moment greatly inflates the indices:
array fields in the index multiply other fields from the document in this index . Be careful with large arrays in the documents: either do not index something else to the array, or play around with the order of listing the fields in the index.
The order of the fields matters ,
especially if one of the index fields is an array . If the fields differ in cardinality, and in one field the number of possible values ​​is very different from the number of possible values ​​in another, then it makes sense to arrange them to increase cardinality.
You can easily save 50% of the size of the index, if you swap fields with different cardinality. Field permutation may also result in a more significant reduction in size.
Sometimes, when the field contains a large value, we do not need to compare this value more or less, but rather a comparison on a clear equality. Then the
index on the field with heavy contents can be
replaced with the index on hash from this field . The index will store copies of hash, not copies of these fields.
Deleting documents
I have already mentioned that deleting documents is an unpleasant operation and
it is better not to delete them, if possible. When developing the design of the data schema, try to provide for either the minimum removal of individual data, or the removal of entire collections. You can delete them with entire collections. Deleting collections is a cheap operation, and deleting thousands of individual documents is a hard operation.
If it still happens that you want to delete a lot of documents, be sure to
do throttling , otherwise mass deletion of documents will affect the latency of reading and will be unpleasant. This is especially bad for latency on secondary.
It is worth making some kind of “handle” to twist the throttling - from the first time it is very difficult to pick up the level. We went through it so many times that trotting is guessed from the third, fourth time. Initially, provide the opportunity to tweak it.
If you delete more than 30% of a large collection, then transfer the live documents to the next collection , and delete the old collection entirely. It is clear that there are nuances, because the load is switched from the old to the new collection, but shift it, if possible.
Another way to delete documents is the
TTL index — an index that indexes the field in which the Mongo timestamp lies, which contains the date of death of the document. When that time comes, MongoDB will delete this document automatically.
The TTL index is convenient, but
there is no throttling in the implementation. MongoDB doesn’t bother to remove these deletes. If you try to delete a million documents at the same time, for a few minutes you will have an unworkable cluster that only deals with deletion and nothing else. To prevent this from happening, add some
random ,
smear the TTL as much as your business logic and special effects on latency allow. TTL smearing is required if you have natural reasons in business logic that concentrate removal at one point in time.
Sharding
We tried to postpone this moment, but it came - we still have to scale horizontally. With reference to MongoDB is sharding.
If you doubt that you need sharding, it means you do not need it.
Sharding complicates the life of a developer and devops in a variety of ways. In the company, we call it sharding tax. When we shard a collection, the
specific performance of the collection decreases : MongoDB needs to maintain a separate index for sharding, and you need to pass additional parameters to the query so that it can be executed more efficiently.
Some sharding things just don't work well. For example, a bad idea to use queries with
skip
, especially if you have a lot of documents. You give the command: "Skip 100,000 documents."
MongoDB thinks so: “The first, second, third ... hundred thousandth, went further. And we will return this to the user. ”
In a non-harnessed collection, MongoDB will perform the operation somewhere within itself. In shardirovannoy - all 100 000 documents, she really reads and transfers to the sharding proxy - in
mongos , which already on its side will somehow filter and discard the first 100 000. An unpleasant feature to be remembered.
With sharding, the code will certainly become more complex - in many places you have to drag the sharding key. This is not always convenient, and not always possible. Some requests will go to either broadcast or multicast, which also does not add scalability. Approach the choice of the key on which sharding will be more accurate.
In shardirovannye collections the operation count
breaks . She begins to return a number more than in reality - can lie 2 times. The reason lies in the balancing process, when documents are poured from one shard to another. When the documents are transferred to the next shard, but have not yet been deleted on the source one,
count
will still
count
them. MongoDB developers do not call this a bug - it is such a feature. I do not know whether they will fix it or not.
Shardirovanny cluster is much harder to administer . Devopsy will no longer greet you, because the procedure for removing the backup becomes radically more difficult. When sharding the need to automate the infrastructure flashes like a fire alarm - something that could have been done without earlier.
How sharding works in MongoDB
There is a collection, we want to somehow scatter it on shards. To do this,
MongoDB divides the collection into chunks using the sharding key, trying to divide them in shard key space into equal pieces. Then the balancer turns on, which diligently
spreads these chunks along shards in a cluster . And the balancer doesn’t care how much these chunks weigh and how many documents are in them, since the balancing goes in chunks one by one.
Sharding key
Did you still decide what to shard? Well, the first question is how to choose the sharding key. A good key has several parameters:
high cardinality ,
uncomfortableness and it
fits well with frequent requests .
The natural selection of the sharding key is the primary key - the id field. If the id field is suitable for sharding, then it’s best to shard it directly. This is an excellent choice - it has a good cardinality, it is not mutable, but how well it fits into frequent requests is your business specificity. Start from your situation.
I will give an example of an unsuccessful sharding key. I have already mentioned the collection of translations. It has a language field that stores language. For example, the collection supports 100 languages, and we shuffle by language. This is bad - cardinality the number of possible values ​​of only 100 pieces, which is small. But this is not the worst - perhaps cardinality is enough for this purpose. Worse, as soon as we step up the language, we immediately find out that we have English-speaking users 3 times more than the rest. At the unfortunate shard, which is English, comes three times more requests than all the others combined.
Therefore, it is necessary to take into account that sometimes the shard key naturally causes uneven distribution of the load.
Balancing
We come to sharding when we have a need for it - our MongoDB cluster squeaks, crunches its own disks, a processor — everything that can be done. Where to go? Nowhere, and we heroically weard the five collections. Shard, run, and suddenly find out that
balancing is not free .
Balancing goes through several stages. The balancer selects chunks and shards, from where and where it will transfer. Further work goes in two phases: first, the
documents are copied from the source to the target, and then the documents that were copied
are deleted .
Our Shard is overloaded, all collections are in it, but the first part of the operation is easy for him. But the second - the removal - quite unpleasant, because it puts a shard on the shoulder blades and so suffers under load.
The problem is aggravated by the fact that if we balance a lot of chunks, for example, thousands, then with the default settings, all these chunks are first copied, and then the uninstaller comes and starts to remove them en masse. At this point, the procedure is no longer affected, and you only have to watch what is happening.
Therefore, if you approach to sharding an overloaded cluster, you need to plan, since
balancing takes time. It is desirable to take this time not in prime time, but during periods of low load. Balancer - switchable spare part. You can go to primary balancing in manual mode, turn off the balancer in prime time, and turn on when the load has decreased to allow yourself to be bigger.
If the capabilities of the cloud still allow it to scale vertically, then it is better to improve the shard source in advance for the gland so that all these special effects can be slightly reduced.
For sharding you need to carefully prepare.HighLoad ++ Siberia 2019 will come in Novosibirsk on June 24 and 25. HighLoad ++ Siberia is an opportunity for developers from Siberia to listen to the reports, to talk on the high-topic and to plunge into the environment “where are everyone”, without flying three thousand kilometers to Moscow or St. Petersburg. Of the 80 applications, the Program Committee approved 25, and we tell about all other changes in the program, announcements of reports and other news in our newsletter. Subscribe to be informed.