📜 ⬆️ ⬇️

MongoDB: $ or VS $ in - what is faster?

According to the cut, there will be a very small comparison of the performance of MongoDB in the cases of using $ or and $ in logical operations in queries. I hope that this note will save someone working time.

Tests run on MongoDB 2.4.9
Suppose there is a collection of documents in MongoDB. To simplify the understanding of the essence - let it be documents with exactly two fields.
$m = new MongoClient('mongodb://mongodb01,mongodb02,mongodba/?replicaSet=pkrs'); $mdb = $m->selectDB('test'); $collection = $mdb->selectCollection('test'); $collection->drop(); $collection->ensureIndex(array('i' => 1, 'j' => 1)); for ($i = 0; $i < 100; ++$i) { for ($j = 0; $j < 100; ++$j) { $collection->insert(array('i' => $i, 'j' => $j)); } } 

The collection will be only 10K documents. Yes, it was possible to use batchInsert here, but I do not want to complicate the understanding of the main essence of the note.

It is necessary to regularly (several times per second) sample up to 1000 documents. The condition of the sample is a set of unrelated pairs i and j .
Since I started working with MongoDB less than a month ago, then the first thing that came to my mind was this kind of query:
 $orArray = array(); for ($i = 0; $i < 10; ++$i) { for ($j = 0; $j < 100; ++$j) { $orArray[] = array('i' => $i, 'j' => $j); } } $query = array('$or' => $orArray); 

The fact that data here is in order is just for example, in order not to bother with business logic. In reality, as I noted above, pairs i and j are in no way connected with each other and go in a chaotic order.
Having tried to fulfill this request, my eyes opened wide with an unpleasant surprise - the request was executed for more than 2 seconds ! Above in the code, you can see that the index was created.
It was generally unacceptable.
I decided to make sure that the network does not slow down here, but the point is in the request.
For the test will make the following query:
 $query = array('i' => array('$lt' => 10), 'j' => array('$lt' => 100)); 

The result for the data volume is the same, but the request is already starting to be performed 0.01 seconds .
It became clear that you need to look for a workaround. And he was found. According to the logic of the query, it was suggested to use $ in instead of $ or. But I could not find how to use $ in right away for pairs of values. If there is such a method, I will be very grateful for the hint.
Since I do not know how to make $ in two fields, then we introduce an artificial field as follows (we blind the values ​​of i and j through the underscore "_"):
 $collection->ensureIndex(array('ij' => 1)); for ($i = 0; $i < 100; ++$i) { for ($j = 0; $j < 100; ++$j) { $collection->insert(array('i' => $i, 'j' => $j, 'ij' => $i.'_'.$j)); } } 

And then our request becomes the following:
 $inArray = array(); for ($i = 0; $i < 10; ++$i) { for ($j = 0; $j < 100; ++$j) { $inArray[] = $i.'_'.$j; } } $query = array('ij' => array('$in' => $inArray)); 

And “lo and behold!” We receive our data in just 0.01 seconds (and it all began with “more than 2 seconds”).
Googling for a bit, I found the following explanation for this phenomenon: when queried with the $ or MongoDB construct, it allegedly makes several queries and then "sleeps" the results. Not sure if this statement is correct, but I haven’t yet found another.

PS Conclusion: do not abuse $ or
PPS In the code below you can see how I measured the time. If someone doesn’t know, I’ll clarify that when you call find () the request is not executed! Only a MongoCursor object is created. And only when the first document is requested, the request itself goes. Therefore, time slices are removed at the first iteration of the document acquisition cycle.
PPPS If it will be interesting to someone to drive tests at home, then here is the whole source code:
 <?php $m = new MongoClient('mongodb://mongodb01,mongodb02,mongodba/?replicaSet=pkrs'); $mdb = $m->selectDB('test'); $collection = $mdb->selectCollection('test'); $collection->drop(); $collection->ensureIndex(array('i' => 1, 'j' => 1)); for ($i = 0; $i < 100; ++$i) { for ($j = 0; $j < 100; ++$j) { $collection->insert(array('i' => $i, 'j' => $j)); } } $orArray = array(); for ($i = 0; $i < 10; ++$i) { for ($j = 0; $j < 100; ++$j) { $orArray[] = array('i' => $i, 'j' => $j); } } $query = array('$or' => $orArray); testQuery('OR Query', $collection, $query); $query = array('i' => array('$lt' => 10), 'j' => array('$lt' => 100)); testQuery('Range Query', $collection, $query); $collection->drop(); $collection->ensureIndex(array('ij' => 1)); for ($i = 0; $i < 100; ++$i) { for ($j = 0; $j < 100; ++$j) { $collection->insert(array('i' => $i, 'j' => $j, 'ij' => $i.'_'.$j)); } } $inArray = array(); for ($i = 0; $i < 10; ++$i) { for ($j = 0; $j < 100; ++$j) { $inArray[] = $i.'_'.$j; } } $query = array('ij' => array('$in' => $inArray)); testQuery('IN Query', $collection, $query); function testQuery ($testName, $collection, $query) { $cursor = $collection->find($query); $cursor->batchSize(1000); $start = microtime(true); $first = true; foreach ($cursor as $doc) { if ($first) { $time1 = microtime(true); $first = false; } } $time2 = microtime(true); $resultFirst = $time1 - $start; $resultOther = $time2 - $time1; echo "{$testName} - First: {$resultFirst} Other: {$resultOther}<br />\n"; } 


UPD 1 dim_s advised instead of the composite (and from the test above it is clear that the composite was used) to make two separate indexes. Having done so, the request processing speed accelerated by about 10 times (up to 0.2 seconds), but it still loses to the variant with $ in
')
I spread what gives out explain:
Explain ($ or with composite index)
 /* 0 */ { "clauses" : [ { "cursor" : "BtreeCursor i_1_j_1", "isMultiKey" : false, "n" : 1, "nscannedObjects" : 1, "nscanned" : 1, "nscannedObjectsAllPlans" : 1, "nscannedAllPlans" : 1, "scanAndOrder" : false, "indexOnly" : false, "nYields" : 0, "nChunkSkips" : 0, "millis" : 0, "indexBounds" : { "i" : [ [ 1, 1 ] ], "j" : [ [ 1, 1 ] ] } }, { "cursor" : "BtreeCursor i_1_j_1", "isMultiKey" : false, "n" : 1, "nscannedObjects" : 1, "nscanned" : 1, "nscannedObjectsAllPlans" : 1, "nscannedAllPlans" : 1, "scanAndOrder" : false, "indexOnly" : false, "nYields" : 0, "nChunkSkips" : 0, "millis" : 0, "indexBounds" : { "i" : [ [ 2, 2 ] ], "j" : [ [ 2, 2 ] ] } } ], "n" : 2, "nscannedObjects" : 2, "nscanned" : 2, "nscannedObjectsAllPlans" : 2, "nscannedAllPlans" : 2, "millis" : 0, "server" : "mongodb01:27017" } 


Explain ($ or with two separate indices for i and j)
 /* 0 */ { "clauses" : [ { "cursor" : "BtreeCursor i_1", "isMultiKey" : false, "n" : 1, "nscannedObjects" : 100, "nscanned" : 100, "nscannedObjectsAllPlans" : 300, "nscannedAllPlans" : 300, "scanAndOrder" : false, "indexOnly" : false, "nYields" : 0, "nChunkSkips" : 0, "millis" : 1, "indexBounds" : { "i" : [ [ 1, 1 ] ] } }, { "cursor" : "BtreeCursor i_1", "isMultiKey" : false, "n" : 1, "nscannedObjects" : 100, "nscanned" : 100, "nscannedObjectsAllPlans" : 300, "nscannedAllPlans" : 300, "scanAndOrder" : false, "indexOnly" : false, "nYields" : 0, "nChunkSkips" : 0, "millis" : 1, "indexBounds" : { "i" : [ [ 2, 2 ] ] } } ], "n" : 2, "nscannedObjects" : 200, "nscanned" : 200, "nscannedObjectsAllPlans" : 600, "nscannedAllPlans" : 600, "millis" : 2, "server" : "mongodb01:27017" } 


Explain ($ in with artificially entered index)
 /* 0 */ { "cursor" : "BtreeCursor ij_1 multi", "isMultiKey" : false, "n" : 2, "nscannedObjects" : 2, "nscanned" : 3, "nscannedObjectsAllPlans" : 2, "nscannedAllPlans" : 3, "scanAndOrder" : false, "indexOnly" : false, "nYields" : 0, "nChunkSkips" : 0, "millis" : 0, "indexBounds" : { "ij" : [ [ "1_1", "1_1" ], [ "2_2", "2_2" ] ] }, "server" : "mongodb01:27017" } 


UPD 2 I drove the test on the latest version of MongoDB 2.6
Indeed, the initial version (composite index for two fields) works much faster there! Namely, in 0.07 seconds. But at the same time, the variant with the index of the form i_j is still in 0.006 - 0.01 seconds (that is, approximately 10 times faster)

Source: https://habr.com/ru/post/219913/


All Articles