MongoDB for growth

Greetings fighters invisible backend!

You have already read MongoDB reviews. Probably, there were excellent online courses at university.mongodb.com . Of course, you already have a promising prototype project using MongoDB.

What can we expect from MongoDB at this stage?

Cheaper storage - reading from slave replicas saves the iops master, no RAID is required, the failure of one disk is not fatal.
Increasing the speed of development - you can allow more carelessness in the design of data structures, because we can easily fix everything on a running application.
We increase the responsiveness of the application - regardless of the development, it is easy to increase the number of leading replicas or the number of shards to compensate for the increased load on the application.
We increase the reliability of the application - regardless of the development, we remove a single point of failure.

And now, you are ready to get involved in a fight - to release the project to the public.

What is the article about?

This is an attempt to summarize your development experience with MongoDB. My first practical acquaintance with this DBMS took place in 2010 in the form of a prototype project. After completing the first 10gen online courses, I confidently applied MongoDB in parallel with SQL, but only on auxiliary and research projects.

Since 2013, I have been working in Smartcat as a leading developer. All new projects we started only with the use of MongoDB. We do most of the development in C #, but there are projects in other languages. In all projects I either designed the data scheme myself, or actively advised colleagues. Throughout the life of the project, I watched the nature of the use of the DBMS. In 2015, the main project of the Smartcat website was smoothly transferred from MSSQL to MongoDB.

This is a description of what needs to be considered in the design so that the project is flexible in scaling.
You can get into a situation where there is a successfully working prototype project. While there are few users, all optimizations are postponed. But, the service "spins up", users have already arrived, and the work can not be stopped. And scaling is impossible without deep refactoring of the database data scheme, and simple queries that worked yesterday, today, put the system.

My main area of development is a C # backend, so examples will be in this vein.

So, let's try to describe the stages of project development in terms of using MongoDB, from a single server to a shard cluster and the reasons that force us to move on to the next, more complex configuration of database servers.

Standalone

Of course, for the 24/7 project, we do not consider a single server. But if this is an internal project or an office solution, then it will be fine. It is only necessary to plan the possibility of stopping the service to remove the backup or to provide for software export / import of all the necessary data. This configuration of special issues in the administration does not cause, but the reliability does not shine.

Basically, single servers are used for development. This can be a common server - in this case, it is convenient to invite a colleague to develop an application on a single database. This can be a locally running service.

Most importantly, we expect that the software solutions in the client code will work identically in terms of replicates. And also, even though it feels like a distant prospect, in a shard cluster.

Replication

Replication is a necessary condition for the continuous work of a combat project. We absolutely count on:

Automatic or manual recovery after the failure of individual servers-copies.
The possibility of "freezing" the state of the database for research or backup removal.
Moving from one data center to another without shutting down the site.

When working replicas we must constantly take care to:

Do not lose network connectivity and leading server.
Slave is not far behind.
Do not lose sync due to opLog exhaustion.

Maybe later we will be able to read something from the slave servers to offload our only writing server.

The composition of the servers replicates and the majority of waiting when recording

In deployment schemes, we denote servers as: P - primary, S - slave, H - hidden, A - arbiter.
Waiting to record most of the replicas hereinafter will be - w: majority .

On non-critical projects (this is if we can afford manual recovery) the following configurations can be used:

PS - blocking the slave, you can make a backup, but at the same time hang all the expectations w: majority .
PH - hidden replica blocking does not interfere with the application, but it cannot be used for reading.
PSH - there are backups, w: majority too, but we are sensitive to server shutdown.

If cluster auto-recovery is required, then the minimum number of servers is 4 in the PSSH configuration. This is the optimal configuration for us now. Backups are regularly removed from the hidden replica, and stopping the service on any machine does not threaten the application. Sometimes it takes a long time to decommission one of the servers. You can stop backups at this time and include a hidden replica in the balancing, or you can add another server to the replica.

By the way, configuration with the arbiter should be used very carefully. The arbitrator increases the total number of servers, but does not write data. It turns out that waiting for w: majority requires a large proportion of servers. So, on the PSA configuration, in case of failure of P or S, the ability to write will be, but without w: majority .

In general, when planning a configuration, you should list all combinations of server failures in order to understand which of them you can compensate automatically. And for manual recovery, you should immediately prepare the instructions.

OpLog Exhaustion

This is a log of write operations to the database. Each slave server replicates it to itself from the master, and the write operations are played on the local copy of the data.

The first problem is that opLog has a limited size and is overwritten cyclically. The main danger is that with intensive recording the opLog history will cover a short time. For any playback delay, for example, as a result of CPU overload or network latency or disk recording, the host server may lose the last recording. In this case, it will stop the synchronization and go into the RECOVERING mode. With intensive recording, there is a very high risk of several slave servers leaving this mode. As a result, you can lose the quorum for selecting a master server, and the cluster will stop accepting data for recording. Cluster recovery is a manual procedure.

We call it - "destroyed" replicas.

The second problem is that the volume of replication traffic can become an obstacle for the separation of servers into different data centers.

We monitor the size of the opLog in seconds, the difference between the time of the first and last records. For example, the minimum size of opLog - 10Ks - allows you to make a backup and not lose server synchronization. If the average opLog size is 100Ks, then this is enough to have time to react when it drops sharply.

When this happens:

Simultaneous raid users with adding content or editing.
Error in business logic, when we start to update a lot of documents.

The saddest thing is that such errors are difficult to detect when profiling queries. It is necessary to analyze opLog directly.

Examples of errors in the backend code:

Flush stream entries in GridFS

When exporting translation results, we stream files to a zip archive, and it is also streamed to GridFS. The file chunk size was increased to 2 Mb, and the recorded files usually had a size of 200–900 Kb.

After each file was written to the archive, the Stream.Flush () method in the library we used was called. And in MongoCSharpDriver v1.11 there is an interesting feature of working with the stream, which we did not immediately find out: when calling the Stream.Flush () method, the current recorded chunk of the file was updated in the database.

For example, when writing a file in 6 Mb, we expect that opLog will get 3 inserts of chunks of 2 Mb each, i.e. we assumed that the data in OpLog would be slightly larger than the file size.

The trouble came with the archive, in which there were many files of 3-5 Kb.

To simplify the calculations, we will assume that we are recording one file of 2 Mb in size (that is, we expect to write to one 2 megabyte chunk). The Flush () method will be called every 4 Kb, and for the entire streaming time it will be called 512 times. Moreover, the size of the update each time will increase from 4 Kb to 2 Mb linear. This is an arithmetic progression with a delta of 4 Kb.
The sum of the first 512 items $(4Kb + 2Mb) / 2 * 521 = 513 Mb$
Instead of 2 Mb we lose in opLog 513 Mb - unexpectedly ...

The first solution is to cache the file in memory before writing to the database, but now we are using the latest version of the driver v2.4.4. There this problem is solved - the Flush () method does not update the data in the database.

Update Nested Array

Nested arrays in MongoDB can greatly simplify development. But they should be used very carefully. Here is another example from practice.

In our data scheme, the document describes the task of exporting the file list. The file list is described by a nested array, the element of which contains a link to the file and the readiness status. Task processing looks like this:

read the document
look for the next file in the nested array and process it,
update the readiness status of the document.

At the development stage, the number of documents was limited to manually putting check marks in the UI. Usually it was about 5, a maximum of 10 files.

Particularly stubborn users could "click" page in 50 elements. Therefore, the developer simplified the algorithm - the entire array was updated immediately. The record size is approximately 300 bytes.

10 updates - $300 * 10 * 10 = 29 Kb$

It seems a little. But, as you can see, the volume consumption of opLog became quadratically dependent on the number of files. A little later, thanks to the development of the convenience of UI, it became possible to choose the export of all files attached to the project.

Sequential export of several projects with 1000 files each - and the falling opLog was caught by forced shutdown of the task handler.

The fix is trivial - updating the array element by its sequence number.

Massive updates are irrelevant if the server is in standalone mode. If the stored data is temporary, then it can even be a regular database operation mode.
If development is underway for use under replication conditions, then in testing it is necessary to use a cluster with replication enabled. And the tests themselves should contain examples with extreme values of the embedded data, size and number of files processed.

Rollback when changing masters

Rollback - is the abolition of part of the write operations on the replica, if they are not confirmed by the majority. This can happen after she, being a master, was turned off, and then she started working again.

Changing the wizard may be unintentional: in case of server restart, emergency shutdown of the service or network failure. Or it can be planned: when updating a service or OS, changing the size of opLog, moving, resync, testing a new server configuration or a database version.

As a result, rollback can lose some of the recorded data, this is a feature of MongoDB replication. For those documents (changes) that cannot be lost (for example, notes about payment for services) you should include w: majority . In this case, after a successful wait, the cancellation of such an operation cannot occur.

Default settings expect to write only to the leading server. We assume a typical execution time of such an operation in 0.01 s. If we expect w: majority , then the typical time increases to 0.06 s, and only if the majority of the replicas are in the same data center.

Fortunately, for most operations there is no need to expect w: majority . Some data can be obtained when re-calculating (for example, exporting a file), some of them are irrelevant in case of failure (for example, the last user work time).

How not to arrange rollback within server maintenance:

If you need to move the wizard from a specific server, then rs.stepDown () will do .
If you need to make the master a specific server, then it needs to raise the priority.

If there is a suspicion that the data was canceled through rollback, then you can search for them in the rollback directory on the servers.

Create index

If we do not want to block the database when building a new index, we build it in the background (option background: true). But in terms of replication, the construction of any index will be launched on each slave c blocking playback of the opLog - there will be a jump in the lag, and if the recording intensity increases, it is not far to the collapse. Also, if the index is built for a long time, the waiting w: majority can be interrupted.

Before building the index, measure its creation time on similar data to estimate the downtime on the battle cluster.

There is a way to build a large index without blocking the slave.

On each slave server, perform a sequence of actions:
1. We switch the server to a single mode.
2. Run the construction of the desired index and wait for the end.
3. We return the server to the replicas and wait for its synchronization.
Start building the desired index on the master server.

Checking the passage of resync

Resync is copying all the replica data and rebuilding the indexes. This may occur when a new database server is entered into a replica or, if you wish, you can defragment free space.

The size of the database of a popular, loaded service may increase over time. Accordingly, the time taken to copy data and build indexes, and the amount of memory for building indexes, is increasing.

It is easy to miss the moment when you will not be able to easily attach another replica to a running cluster. And this is an important procedure in the event of server loss, upgrade, or moving to another data center. If this happens, you can try to copy the service data directory or remove the server disk image.

I usually think that resync should take no more than $2/3$ opLog size.

By the way, there is a way to increase the size of opLog without data loss - Change oplog size

Backup

The process of copying data takes time, and to get their snapshot (snapshot), we use the db.fsyncLock () command on the hidden replica to stop recording to all collections. After that, using the mongodump utility, copy all the data into the backup storage.

In the process of removing the backup, the state of the data is increasingly lagging behind the master. Of course, you need to make sure that the time taken to remove the backup is no longer than the duration of the opLog.

There are many more ways to make a backup: a copy of the disk image, a copy of the database directory, a snapshot of the virtual machine.

Mongodump is the slowest way, but it has advantages:

Does not depend on the virtualization system.
It does not depend on the version of the database (restoring old backups may require the deployment of the previous version of the database).
You can restore selected collections or even some documents, and as practice has shown, fatal errors in code or during administration rarely affect more than one collection.

Reading from slave servers

A copy of the data, albeit a bit outdated, itself asks for it to be read. This helps unload the processor and disk on the primary server, but unfortunately, not all data can be read from the slave servers.

The average time lag of replicas in the local network is one and a half seconds. If monitoring for lagging is set, then you can do, for example:

Calculation of statistics for the last day.
Archiving irrelevant data.
Reading "long" recorded files from the GridFS.

But the only recording server is still waiting for our next step.

Sharding

Sharding is a standard way to scale MongoDB horizontally. In my opinion, it is rather difficult to determine at what point a transition from one replicates to a shard cluster can be justified.

The administrative overhead and the cost of additional servers or virtualwoks are not easy to submit to the project management. Instead of four servers, we must immediately go to eight, plus we need at least three servers for the configuration replicaset.

As an alternative to the shard cluster, for a long time it will be possible to gradually increase the number of CPU cores, the size of RAM and the size of disks, as well as switch to RAID or replace HDD with SSD.

Unfortunately, there is a limit to increasing the size of a single server. For example, we rested on the speed of reading from disks. At this time, our database was located in Azure. The servers used a 512 Gb SSD with a maximum iops 2300. Of course, a simpler step is to upgrade the subscription to 1Tb with a maximum iops 5000, but further database growth made it impossible to remove and restore the backup in a reasonable time.

The fact is that the time taken to remove the 450 Gb backup came close to 6 hours. Further growth of the database led to the fact that it would have to increase the size of the opLog. He was at that time in 32 Gb. In addition, the average opLog duration was 12 hours (yes, this is a modest data stream of only 750 Kb / s), and any unexpected load, such as importing a very large file, caused the server from which the backup was taken to lose synchronization , and he had to be taken to resync for 12 hours with an obscure result.

On the other hand, the introduction of the shard cluster reduced the backup time, so the size of the opLog could also be reduced. The project has ceased to exceed iops. Well, from the point of view of development and administration, migration to a shard cluster occurs once, and you can add new shards repeatedly and as needed.

Adding shards is a simple way to compensate for the speed limits of the disk subsystem, to shorten the backup time and restore the database. In the future, it will be possible to reduce the size of the opLog, the amount of RAM and CPU cores on each replica.
It would be very good to achieve uniform load sharing for shards. With the growth of the project, this makes it possible without additional development to keep the responsiveness of the database and administration at the required level. Unfortunately, the uniformity of the load will strongly depend on the data scheme.

When you turn on sharding, we have 2 main groups of problems.

Sharding collection is not possible if:
- the update of the fields included in the sharding key is required;
- There are several unique keys on the collection;
- under the result of the findAndModify query, the data on different shards falls.
Unbalanced load on shards, if:
- partially ordered identifiers are used;
- weak sharding key selectivity;
- queries without sharding key values.

Objectid

This is a globally unique identifier (similar to the GUID), and it is partially ordered. In practice, this means that if there is an intensive addition of records to the collection, they will fall mainly on one shard. Worse yet, fresh documents are usually the most requested - we also get a serious imbalance on the CPU and opLog.

There is a sharding method using a hash, but it applies only to a single field.
In this case, search queries and updates without specifying an identifier will load all shards at once.
Those. the benefit of sharding by hash will be only if you work with the document only by its identifier.

Gridfs

Data scheme for storing unmodifiable files. It seems that this is the first candidate for sharding, but, alas, not everything is so simple.

In GridFS, chunks (“pieces” of the downloaded file) have the following scheme:

{ "_id" : <ObjectId>, "files_id" : <TFileId>, "n" : <Int32>, "data" : <binary data> }

The MongoDB driver usually tries to create a unique index by {files_id: 1, n: 1}

There are 2 ways to separate chunks:

Only by file ID is an example of weak sharding key selectivity.
This ensures that the file will not be shared between shards, which means that it will be easier to restore it from the backup. If the file exceeds the maximum chunk size, it will become unmovable. The risk of unbalance in the size of the data on the shard increases.
By file identifier and chunk sequence number.
The maximum selectivity of the sharding key is up to one document (file chunk) per sharding chunk.

If the file ID is standard (TFileId is ObjectId), then we have a difficult choice:

Adding an index on a hash and sharding on a hash from files_id leads to the risk of getting unmovable files and unbalance in place.
Sharding by {files_id: 1, n: 1} threatens to unbalance the opLog and iops disk, and if there is an intensive addition of files, then it is better not to include sharding at all.

So it makes sense to first change the ID to GUID - this will save you from many problems in the future.

Key uniqueness

You cannot turn on sharding on a collection that has several indexes with uniqueness.
If, however, several fields are required for checking for uniqueness — for example, an email and a phone number associated with one user — then this case will have to be resettled into a separate collection.

Schemes of work may be different:

Sequential inserts.
Example:
collection of users (index with uniqueness by email)
```
 { _id: <ObjectId>, //  email: <string>, phone: <string> } 
```
phones collection (phone index with uniqueness)
```
 { _id: <ObjectId>, //  phone: <string> } 
```
First, the insertion into phones is done, and if successful, the insertion into users.
Delayed setting.
In the scheme from the previous case - the phone is always stored separately, and the user is allowed to add it after registration.

Select sharding key on an existing collection

When designing a collection data scheme, we may not know (ignore) how we will shard it. But there are no regular ways to change the sharding key, so it is especially important not to miss the key selection.

The selection algorithm may be as follows:

Once again we recall the limitations .
We add others to the existing indexes with all possible (and meaningful) combinations of fields.
We list all candidates for keys sharding. For each valid index, you can select a key by its any prefix. For example, for the index {a: 1, b: 1, c: 1} you can select the prefixes {a: 1} and {a: 1, b: 1} or the full key {a: 1, b: 1, c: 1} .
Important! If the field is in the index, but the shard key is built by a prefix without this field, then this field remains changeable.
For each candidate we evaluate:
- Selectivity, is there a limit on the growth of the minimum chunk.
- The ability to add specific search key values to each search query.
- The ability to exclude the update of the field that fell into the key of sharding.
- The share of requests that are not isolated on a single shard.
Select the key with the maximum share of isolated requests.

Sharding on the prefix index with uniqueness

An example of key selection, when I had to stop the choice on a less selective sharding key.

Entities in our Smartcat project:

Account (Account) - the owner of the documents.
Document (Document) - a document for translation (identifier globally unique).
Segment - proposal in the document.

Segment Fields:

accountId - account identifier.
idInAccount - unique identifier within the account.
documentId - document identifier.
order - the sequence number of the segment in the document.
other...

Indices:

{documentId: 1, idInAccount: 1}
{documentId: 1, order: 1}
{accountId: 1, idInAccount: 1}, {unique: true}

Most requests include a documentId. So, we will choose from the first two indices, and this excludes the accountId sharding. Then we will remove the uniqueness from the third index. Each document has a globally unique identifier and cannot belong to two accounts. Therefore, the first index can be made unique, and the second can be used to remove uniqueness.

In our code, idInAccount is known only in one search query, i.e. When choosing a full sharding key, document segments can be divided into different chunks, and search queries can refer to multiple shards.

Consider the version of sharding by {documentId: 1} - this is valid because it is a prefix with uniqueness. We estimate the maximum volume of segments of a single document - this is approximately 50 Mb. So, the document is perfectly included in the size of a single chard sharding (we have 64 Mb).

Result:

We build an index with uniqueness on {documentId: 1, idInAccount: 1}
We remove uniqueness with {accountId: 1, idInAccount: 1}
Enable sharding by {documentId: 1}

Deleting a database or collections

On the combat service, this usually does not happen. The database or collection is created, indexes are added, and sharding is turned on. And then - long operation.

With test stands more interesting. As a rule, removing the database is a regular procedure. With the advent of the shard cluster as part of the test bench, we began to detect a heterogeneous collection of collections depending on the router used (mongos). It turned out that this is a known problem .

In short, deleting the database or collection requires resetting the cache of each router and additional cleaning in the shard cluster configuration.

But for test stands, you can do easier:

Delete all collections in the database. Her lead shard does not change.
Run scripts to add indexes.
Run sharding enable scripts - this is a distributed transaction across all routers. It guarantees the cleaning of old data in all caches.

Summary

Many decisions when choosing a data scheme for sharding or replication may appear to be a re-complication during the initial development of a project or prototyping.

But if you go for simplification to the detriment of scaling, you should clearly reserve time for refactoring the data schema.

Checklist for planning a data schema and index collection:

Save replication traffic - minimize document update size.
Remember about rollback - business logic should support a break in write operations, as well as a sudden shutdown of the server.
We do not interfere with sharding - on large collections we do not have more than one index with uniqueness, but it is better not to include uniqueness at all if it is not critical for business logic.
We isolate search queries with a single shard - most search queries must include sharding key values.
Balancing shards - carefully select the type of identifier, the best of them is the GUID.

That's all for now!
Do not lose the cluster, fighters invisible backend!

And special thanks to the colleague nameless_one for editing and advice!

Source: https://habr.com/ru/post/335698/

All Articles