Why we chose MongoDB

This article came into being after reading the material “Why you should never use MongoDB . ” Below is a story about how we gradually abandoned MySQL and came to using MongoDB as the main data repository.

It all started somewhere in 2008, when it was decided to write a second version of our site. For some time, we wanted to create a multilingual version of the database of games, near-game companies, characters, etc., since the existing solution seemed to be outdated.
')
The first step was formulated:

Requirements

These are mainly database requirements.
Of the significant can be identified:

Requirement 1. Multilingual fields

Each entry can have one or more multilingual fields containing both relatively short titles and long descriptions. Thought over different options:

Option 1. Two tables for the entire database

create table name ( id int not null primary key, table varchar(32) not null, field varchar(32) not null, object_id int not null, name varchar(255) not null, lang varchar(2) not null ) create table description ( id int not null primary key, table varchar(32) not null, field varchar(32) not null, object_id int not null, description text, lang varchar(2) not null, )

Accordingly, if the game (table game) has a multilingual name, an alternate name alt_name, and a description of desc, then you would have, in addition to the game itself, three more entries per language.

An example of entries in the name table:

 id | table | field | object_id | name | lang ---|-------|----------|-----------------------|----- 1 | game | name | $game_id | $name | en 2 | game | alt_name | $game_id | $alt_name | en

An example entry in the description table:

 id | table | field | object_id | description | lang ---|-------|-------|-----------|-------------|----- 1 | game | desc | $game_id | $desc | en

The same tables could be combined into one, using the type text for storing names, but I did not like this solution; why - I do not remember.

Option 2. For each main table - its own multilingual

For the same table of games you get about the following:

 create table game_i18n ( id int not null primary key, game_id int not null, name varchar(255) not null, alt_name varchar(255) not null, description text, lang varchar(2) not null )

Record example:

 id | game_id | name | alt_name | desc | lang ---|----------|----------|--------------|----------|----- 1 | $game_id | $name | $alt_name | $desc | en 2 | $game_id | $name_ru | $alt_name_ru | $desc_ru | ru

Option 3. Save multilanguage fields as a json-array in a separate field of the main table

 create table game ( id int not null primary key, ..., i18n text )

Record example:

 id | i18n ---|-------------------------------- 1 | {'name':[{'lang':'en','value': $name}, {'lang':'ru','value':$name_ru}], 'alt_name': [...], 'desc': [...]}

The third option is the most flexible, but almost immediately was abandoned, since sorting and filtering by name were needed, and you would still have to do something similar to option 1 or 2. Plus, if you need to, say, delete the English name in several games, it will be difficult to do with the tools of SQL itself.

As a result, we stopped at option 2. Against the first option, multilingual fields could have their own additional fields. For example, in games you need the ability to mark one of the names as the main thing, other objects may have their own set of additional fields, by which it is quite possible that you will also need to filter / sort. I didn’t want to come to the following when choosing the first option in a couple of years:

 create table name ( id int not null primary key, table varchar(32) not null, field varchar(32) not null, object_id int not null, name varchar(255) not null, lang varchar(2) not null, field1 int, field2 varchar(32), ..., field9 datetime )

And then in the code to remember - what field3 is at the table of companies. In addition, somehow uncomfortable when creating the next table with five entries, dump the translations into a table with a million entries. However, the last in all its glory appeared here:

Requirement 2. Links

There was a desire to be able to associate any object from any table with any other object while maintaining the direction of communication.
The options are about the same as in the first requirement:

Option 1. One table for all links of the base

 create table link ( id int not null primary key, table1 varchar(32) not null, object1_id int not null, table2 varchar(32) not null, object2_id int not null )

Sample entries:

 id | table1 | object1_id | table2 | obect2_id | ---|---------|------------|--------|-----------| 1 | game | $game_id | genre | $genre_id | 2 | game | $game_id | game | $game2_id | 3 | game | $game2_id | game | $game_id |

Entries # 2 and # 3 implement bidirectional communication, entry # 1 - unidirectional from game to genre.

Option 2. For each type of connection has its own table

For example, the connection between similar games would have turned out like this:

 create table similar_games_link ( id int not null primary key, game1_id int not null, game2_id int not null )

and so on for each type of connection.

Option 3. Store the connection in the object itself in text form

 create table game ( id int not null primary key, ..., links text )

Example:

 id | links | ---|----------------------------------| 1 | #game:$game2_id#genre:$genre_id# |

Then you can look for something like this:

 select id from game where links like '%#game:$game2_id#%'

Option 4. Analogue of option 3, but store json

Example:

 id | links | ---|------------------------------------------------------------------------------------------| 1 | [{'table':'game', 'object_id': $game2_id}, {'table': 'genre', 'object_id': $genre_id}] |

Options 5 and 6. For each type of connection its own field.

A variation of options 3 and 4, but the links are spread across different fields.

 create table game ( id int not null primary key, ..., similar text, genres text )

As a result, we decided to store everything in one table (conveniently), plus, in some cases, it would be possible to duplicate information about the connections in the fields of the object itself (options three through six). I did not want to produce my own table for each type of connection, and the third option could help. Of course, doing links like '% # game: $ game2_id #%' is terrible, but I would survive. This option was abandoned because deleting records turned into a non-trivial task. The fourth and sixth options themselves are generally useless.

Requirement 3. Objects with a different set of fields in the same table

For example, news, articles and videos you want to store in one table, because, first, they need to be shown on the site in a common tape in chronological order, and second, there is a lot in common between these types of records (name, date of creation / change, text). But there are differences between the posts, let's call them metadata. For example, for reviews (one of the subtypes of articles), you can specify estimates by which it would be nice to be able to sort, the video indicates the resolution of the original video, the duration, whether or not the gameplay is shown, etc. Depending on the type of post changes and display on the site.

The solutions to how to store metadata are the same as above. One table is created in which the fields for all types of records will be common. And then a few options. You can store all the metadata directly in the text of the post (or in a separate text field) with special tags, as done in Wikipedia, and when saved, scatter over linked supporting tables. You can immediately create auxiliary tables for each of the post types and save the metadata there (we chose this option, especially since for different types of posts all the same different editing forms were created in the admin area). You can store metadata in the form of json or any other serialized form (the problems are also the same - the difficulty of changing such serialized data using SQL tools, plus sorting / filtering).

Requirement 4. Complex objects

The game can be released on different platforms and can have different editions for each platform. The release on the platform and the publication has a set of fields that coincides with the game itself. An example of such a field is “Name”, since, for example, the name of a publication for a specific platform may differ from the canonical name of a game. Also, the platform and the publication have a set of fields that are not in the game itself, for example, for the platform it will be the platform itself, the publication has the date of its release. How to store all this? In the form of three separate tables? By analogy with how storage of objects with a different set of fields is solved in requirement 3? Or store the game itself as a single record, and all platforms and editions - as json in a separate field of this record? And how to edit such joy? Do three different forms? At the same Pac-Man 27 platforms and more than 30 editions, editing such a monster can turn into torture. And how to show it? For example, to show the publication, you have to load the platform and the game for it, because, for example, the publication may not have its name. Then you need to watch the general name of the game on the platform, and if it is not there, then watch the name of the game itself. In this case, prescribe to all publications the same title - is also not great.

Previously, I stopped almost on the same version as in requirement 3 - one table for games. But, since there were only three record types, it was decided to store the distinct fields in the same table and not produce tables for metadata.

Mysql

Having decided on the requirements and preliminary versions of their solution, we began to develop the admin panel. And then (as always) problems started. For example, a company may have a name and description. Create one company_i18n table with name and description fields. So far, so good. In edit form

HTML form:

   en: Bad Pixel [x] ru:      [x] ru:  “ ” [x] [ ] [ ]

Separately, its own set of names in different languages and its own set of descriptions are indicated, but this is not a problem - the names and descriptions are saved by language, and for each language, one record is created in the company_i18n.

Entries in company_i18n after saving:

 id | lang | name | description ---|------|------------------|-------------------------------- 1 | en | Bad Pixel | NULL 2 | ru |  “ ” |

Then it turned out that there can be several descriptions in one language, and the name should be strictly one for the language, and we come to something like:

 id | lang | name | description ---|------|------------------|------------ 1 | en | Bad Pixel | NULL 2 | ru |  “ ” |  1 3 | ru | NULL |  2

It does not look very good already, especially if you need to delete “Description 2” using SQL - you need to look at whether there is a name in the name field, and if it is, update the record, and if not, delete it. Then the company’s main flag appears, the field for corporate names appears, which may be several in the same language (for different periods of time), and comes the understanding that it seems you will have to store names and descriptions in different tables.

The connections between the objects almost immediately appeared the power of communication. This was not a problem, problems began when a different set of additional fields appeared for different types of links. For example, genres can be associated with other genres and be their subgenres, and the connection should be bidirectional: in tags, some tags are characters from another tag-universe, and games have the same company can be both a developer and a publisher. Although the problem can be solved by adding new fields to the link table, it would be more correct to create separate auxiliary tables for individual types of links.

Of course, I understand that life is a pain, and the developer (in particular) must not forget to suffer, so the development continued slowly but surely, and the auxiliary tables appeared with enviable regularity. Nevertheless, I wanted to somehow automate the processes of creating such auxiliary tables and assembling a complete object from them. With such thoughts in early 2010, I came across an article “ How FriendFeed uses MySQL to store data without a schema ”.

MySQL and data without schema

The idea to make NoSQL over MySQL does not look so crazy even now. At that time, MySQL had been developing for years and was a reliable solution for production, and specialized NoSQL solutions only started to appear, and did not want to, making a choice in favor of one thing, after a couple of years, be alone with an unsupported product. I think those who, like me at one time, made a bet on prototype.js, will understand me.

At that time, we did not even consider MongoDB for various reasons, in particular, it was not yet recommended for production (the first production ready release was at the end of the first quarter of 2010). I still advocate such an approach: use relatively well-established solutions for projects and at the minimum involve handwritten counterparts. But then there were no well-established solutions (or it seemed that they did not exist), and for one of my third-party projects, I wrote something similar to what FriendFeed had. Understand me correctly: I am not proud of it - the idea of doing something of your own can be tempting exactly until you have to maintain it, fix bugs, optimize, develop functionality, adapt to new versions of the language / libraries / used services. The only thing I regret is - then it was necessary to download and feel the "mongu", this is a matter of half an hour, maximum of an hour, and good for years. Actually, this applies to all new technologies: they do not appear for a reason, and knowledge of current trends allows us to simply expand our horizons.

So, the library was written to work with data without a scheme and storing all of this in MySQL.

A brief description of what happened.

The objects were stored in a serialized form in the blob field of the entity table. An additional category field was introduced (analogous to tables in MySQL and collections in MongoDB) so that you can separate objects, such as games, companies, etc. (from the option when, for example, all messages are stored in the subject object, we almost immediately refused - for the messages have their own category, for the subject is their own), plus two fields - the time for creating and updating the object. Since everything was written in Perl, the Storable library was used to serialize data structures (usually a combination of hashes and arrays) from the internal representation to the binary view and back. First of all, it was done because of the speed, in json the data was converted an order of magnitude slower. In the second - due to a more compact representation compared to json.

The id used was uuid, 16 bytes of which were encoded using base 64 into a text string 22 bytes long.

For any changes to the object, it was possible to hang “triggers” - functions that changed both the object itself and other objects associated with the current one. For example, when writing a comment, the trigger can change the total number of comments in the topic object.

Since no queries could be done on the entity table (except for the primary key and category), indexes were introduced — regular MySQL tables that were created based on the fields of the object being saved.

 CREATE TABLE IF NOT EXISTS `index_platform` ( `generation` int(10) unsigned NOT NULL, `path_id` varchar(255) NOT NULL, `entity_id` binary(22) NOT NULL, KEY `generation` (`generation`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

It was written in the config:

 $db->index( 'table' => 'index_platform', 'properties' => ['generation', 'path_id'], 'shard_on' => 'entity_id', 'categories' => ['platform'], ));

Then when saving the object:

 my $uuid = $db->put({ 'generation' => 0, 'path_id' => 'pc', 'name' => [ {'lang' => 'en', 'value' => 'PC'} ] });

In index_platform, an entry was created automatically:

 generation | path_id | entity_id -----------|---------|---------- 0 | pc | $uuid

For these indices it was already possible to make selections:

 $db->resultset('index_platform')->search({ path_id => {'in' => ['pc', 'xbox']}, generation => {'in' => [0, 1]} }, { order_by => 'generation desc', limit => 10, page => 2, join => 1 #    ,      entity,   ,     })->all();

Alternative variant of the same request:

 $db->resultset('index_platform')->search({ path_id => {'in' => ['pc', 'xbox']}, generation => {'in' => [0, 1]} })->order_by('generation desc')->limit(10)->page(2)->join(1)->all();

There were only two types of interaction with the database: this is the change of the objects themselves in the entity table (including deletion) and queries to indexes as in the example above. JOIN is software only.

When deleting an object, it was only marked as deleted, without being physically deleted from the entity table.

Around the middle of 2010, we tried to switch to this method of data storage.

Objects can now be saved in this form:

 $company = { 'name' => [ {'lang' => 'ru', 'value' => ' “ ”', 'is_primary' => true}, {'lang' => 'en', 'value' => 'Bad Pixel'}, ], 'description' => [ {'lang' => 'ru', 'value' => '    ”'}, ], 'link' => [ {'category' => 'tags', 'id' => $tag_uuid, 'degree' => 3}, {'category' => 'game', 'id' => $game_uuid, 'role' => 'developer'}, ] };

In the name and link fields, records were automatically created in the index_name index_name and index_link tables. Objects could be of any complexity and nesting, with a different set of fields for an object from the same category. It was necessary to create index tables as before, but it became much easier. If there was not enough of a field, it was enough to change the code, and if it was necessary to make selections for this field, an additional index table was created or an existing index changed. If some index did not suit, you could just delete it and build a new one. In perspective, I wanted to make the creation of such index tables automatically by describing their structure in code.

Such data storage (storage), along with advantages, had significant drawbacks.
Of the benefits could be identified:

the availability of transactions (although for me it was not so important)
the existence of a patch for mysql, which allowed to process up to 750 thousand readings per second using a primary key
scalability - the architecture did not interfere with the shardit of data on the primary key, while for the application working with the storage, nothing changed
since uuid was used as the primary key, it was possible to merge painlessly as many databases without collisions on the primary key as possible
generation of index tables on the fly

The main disadvantages:

you need to independently develop and maintain the repository code itself
lack of support for different languages
lack of atomic operations on data; to make $ company -> {'link'} -> [0] -> {'degree'} ++, you need to load the object into the application, change and save it back to the database
it is impossible to change several records with a single request, only through a map function in a third-party application
transactions, scalability, generation of index tables needed to be implemented
lack of console

During the operation, various “bugs” of the storehouse itself emerged, of the most unpleasant - the dependence of the object serialization algorithm in the Storable library on the operating system. This was decided by moving to the storage of objects in the form of json with compression using gzip. By the way, it was precisely during the correction of this “bug” that I clearly realized that it does not matter how the objects themselves are stored. This may be a separate table in the database, or you can save it stupidly as json files, calling them by the primary key and scattering them in subfolders (however, this deprives the application itself of scalability, problems will arise due to race condition, etc. although, on the other hand, it would be possible to try Hadoop, but, frankly, it would be superfluous). The main thing is to be able to create indexes for objects, as is done, for example, in the Sphinx search engine. Why not take MySQL in much the same way as Sphinx? Why not store the data in the form of key-value storage, but for searching, sorting and outputting various lists to the user, create suitable indices for this in suitable services? Of course, if billing is created, then this approach, to put it mildly, is not very justified, but web applications for the most part are less demanding of having the same ACID, and you have to suffer almost as much as billing.

However, gradually the disadvantages when using samopisny storage began to outweigh, and also for each object it was necessary to draw its form in the admin panel, in a word - there was not that universality that I wanted. Plus, at AG, in 2012, there were a number of rather political events, such as a change in the site owner, management and management, and it was decided to write a second version in Python, since the programmers at the new company wrote in this language. There were two options - either to issue the current storage as a standalone service, or use any existing key-value storage.

MongoDB

How would you feel if once it turned out that someone created an analogue of your library (or, vice versa, you created an analogue of an already existing library without knowing it), and this analogue, while preserving the essential advantages of your solution, is not yet possessed of his minuses? Personally, I was glad. Gorgeous console with full support for Javascript, atomic operations, sharding, automatic creation of indices for selected fields, libraries for main programming languages ... At that time, there were already Python frameworks that supported MongoDB or were written specifically for it. And all this did not need to be supported or developed. Yes, in addition, api was also similar to api repositories.

As a result, starting from scratch (already as Riot Pixels) development in Python in 2013, we, having correctly chosen the tools (one of which was MongoDB), did more for the quarter than we did in two years before. Another, as it seems to me, correct choice was the choice of the admin panel, which allowed editing objects of any nesting - because of this, there was almost no time wasted on its development.

I would like to finish with this. It is probably wrong to oppose MongoDB and MySQL, because they are suitable for different tasks differently, and it just so happened that in this project we were more suited to MongoDB. If suddenly this happens and the MongoDB speed or functionality is not enough, then nothing will prevent MySQL from being used as a caching layer / index for data — once set up and forgotten.

Source: https://habr.com/ru/post/232539/

All Articles