📜 ⬆️ ⬇️

MongoDB: Too many fields to index? Use a common index

The essence of the problem


There are situations when documents have many different fields and you need to have effective queries on them. For example, there is a document describing a person:

{ _id: 123, firstName: "John", lastName: "Smith", age: 25, height: 6.0, dob: Date, eyes: "blue", sign: "Capricorn", ... } 


According to such documents, you can make a selection of people by eye color, a certain height, name, and other characteristics. And what if for example a document consists of dozens of fields, or are not known in advance, or does each document have its own set of fields? How to quickly solve this problem with the help of indexes, but at the same time not to build them for each field, because it is too expensive a solution.
')

Solution # 1: Composite index by field names and values


We will design the document schema using the opportunity to store the document fields as a list of objects:

 { _id: 123, props: [ { n: "firstName", v: "John"}, { n: "lastName", v: "Smith"}, { n: "age", v: 25}, ... ] } 


To solve the problem, a composite index is created by the name and value of the objects inside the list. For clarity, let's create 5 million documents consisting of dummy properties from prop0 to prop9 that have a random value from 0 to 1000 .

 > for (var i = 0; i < 5000000; ++i) { var arr = []; for (var j = 0; j < 10; ++j) { arr.push({n: "prop" + j, v: Math.floor(Math.random() * 1000) }) }; db.generic.insert({props: arr}) } > db.generic.findOne() { "_id": ObjectId("515dd3b4f0bd676b816aa9b0"), "props": [ { "n": "prop0", "v": 40 }, { "n": "prop1", "v": 198 }, ... { "n": "prop9", "v": 652 } ] } > db.generic.ensureIndex({"props.n": 1, "props.v": 1}) > db.generic.stats() { "ns": "test.generic", "count": 5020473, "size": 1847534064, "avgObjSize": 368, "storageSize": 2600636416, "numExtents": 19, "nindexes": 2, "lastExtentSize": 680280064, "paddingFactor": 1, "systemFlags": 1, "userFlags": 0, "totalIndexSize": 1785352240, "indexSizes": { "_id_": 162898624, "props.n_1_props.v_1": 1622453616 }, "ok": 1 } 


In this case, the size of the index is 1.6 GB because the index stores both the name of the property and its value. Now let's try to find documents in which prop1 is 0 :

 > db.generic.findOne({"props.n": "prop1", "props.v": 0}) { "_id": ObjectId("515dd4298bff7c34610f6ae8"), "props": [ { "n": "prop0", "v": 788 }, { "n": "prop1", "v": 0 }, ... { "n": "prop9", "v": 788 } ] } > db.generic.find({"props.n": "prop1", "props.v": 0}).explain() { "cursor": "BtreeCursor props.n_1_props.v_1", "isMultiKey": true, "n": 49822, "nscannedObjects": 5020473, "nscanned": 5020473, "nscannedObjectsAllPlans": 5020473, "nscannedAllPlans": 5020473, "scanAndOrder": false, "indexOnly": false, "nYields": 0, "nChunkSkips": 0, "millis": 252028, "indexBounds": { "props.n": [ [ "prop1", "prop1" ] ], "props.v": [ [ { "$minElement": 1 }, { "$maxElement": 1 } ] ] }, "server": "agmac.local:27017" } 


This solution did not produce the expected result: ~ 50,000 documents were found in 252 seconds. This happens because each query n=prop1 and v=0 does not require the fulfillment of both conditions simultaneously for attached documents, therefore the final result includes documents satisfying both the requirement n=prop1 and v=0 separately, and this is not at all what was expected. You can refine your request using $elemMatch :

 > db.generic.findOne({"props": { $elemMatch: {n: "prop1", v: 0} }}) 


Now let's check how the index is used and how long the query is executed in MongoDB v2.2:

 > db.generic.find({"props": { $elemMatch: {n: "prop1", v: 0} }}).explain() { "cursor": "BtreeCursor props.n_1_props.v_1", "isMultiKey": true, "n": 5024, "nscannedObjects": 5020473, "nscanned": 5020473, "nscannedObjectsAllPlans": 5020473, "nscannedAllPlans": 5020473, "scanAndOrder": false, "indexOnly": false, "nYields": 0, "nChunkSkips": 0, "millis": 278784, "indexBounds": { "props.n": [ [ "prop1", "prop1" ] ], "props.v": [ [ { "$minElement": 1 }, { "$maxElement": 1 } ] ] }, "server": "agmac.local:27017" } 


The request was executed correctly and returned 5024 documents, but still slow! From the information of the explain command, you can see that the reason is that the range is still used for the v field. In order to understand why this is happening, let's look at an example in more detail. If you do not use $elemMatch then all combinations of fields that satisfy at least one of the query conditions separately can fall into the final sample. In this case, it would be impossible to use to maintain the index, because it would lead to a huge number of possible combinations. Therefore, when requested, MongoDB made a choice in favor of building a B-Tree from the values ​​of attached documents and ignoring possible combinations (the main behavior for $elemMatch ). But why is the query with $elemMatch running so slowly? This was due to a bug that was fixed by SERVER-3104 in MongoDB v2.4. Let's check the same request in the corrected version:

 > db.generic.find({"props": { $elemMatch: {n: "prop1", v: 0} }}).explain() { "cursor": "BtreeCursor props.n_1_props.v_1", "isMultiKey": true, "n": 5024, "nscannedObjects": 5024, "nscanned": 5024, "nscannedObjectsAllPlans": 5024, "nscannedAllPlans": 5024, "scanAndOrder": false, "indexOnly": false, "nYields": 0, "nChunkSkips": 0, "millis": 21, "indexBounds": { "props.n": [ [ "prop1", "prop1" ] ], "props.v": [ [ 0, 0 ] ] }, "server": "agmac.local:27017" } 


Request executed in 21 milliseconds!

Solution # 2: One Common Index


Another way to solve the problem is to store the fields in the list as property: value objects. This solution works in MongoDB version v2.2 and v2.4. Create documents of the form:

 > for (var i = 0; i < 5000000; ++i) { var arr = []; for (var j = 0; j < 10; ++j) { var doc = {}; doc["prop" + j] = Math.floor(Math.random() * 1000); arr.push(doc) }) }; db.generic2.insert({props: arr}) } > db.generic2.findOne() { "_id": ObjectId("515e5e6a71b0722678929760"), "props": [ { "prop0": 881 }, { "prop1": 47 }, ... { "prop9": 717 } ] } 


Build the index:

 > db.generic2.ensureIndex({props: 1}) > db.generic2.stats() { "ns": "test.generic2", "count": 5000000, "size": 1360000032, "avgObjSize": 272.0000064, "storageSize": 1499676672, "numExtents": 19, "nindexes": 2, "lastExtentSize": 393670656, "paddingFactor": 1, "systemFlags": 1, "userFlags": 0, "totalIndexSize": 2384023488, "indexSizes": { "_id_": 162269072, "props_1": 2221754416 }, "ok": 1 } 


The size of the index turned out to be ~ 2.2 GB in size, which is 40% more than in solution # 1 because BSON of the attached documents stores itself in the index as BLOBs. Now run the query:

 > db.generic2.find({"props": {"prop1": 0} }).explain() { "cursor": "BtreeCursor props_1", "isMultiKey": true, "n": 4958, "nscannedObjects": 4958, "nscanned": 4958, "nscannedObjectsAllPlans": 4958, "nscannedAllPlans": 4958, "scanAndOrder": false, "indexOnly": false, "nYields": 0, "nChunkSkips": 0, "millis": 15, "indexBounds": { "props": [ [ { "prop1": 0 }, { "prop1": 0 } ] ] }, "server": "agmac.local:27017" } 


The request was completed in 15 ms, which is faster than the first solution! But there is one condition, when making a request, it is necessary to describe the entire subdocument object. In order to perform a sample of documents that meet the query where prop1 can be equal to from 0 to 9 , you need to run the query:

 > db.generic2.find({"props": { $gte: {"prop1": 0}, $lte: {"prop1": 9} }) 


It is a little inconvenient, as well as if there are other fields in the attached document, they should be involved in the preparation of the request (as the attached documents are stored as BLOBs).
There is also another limitation: it is impossible to index separately only the field values, whereas in solution # 1 you can build an index props.v for searching for example all documents having the value 10 . Solution # 2 does not allow this.

Conclusion


You can see that MongoDB v2.4 now offers a simple and effective solution for building common indexes for documents with a large number of fields that you can use for your “Big Data” projects.

Source: https://habr.com/ru/post/177761/


All Articles