Map-Reduce on the example of MongoDB

Recently, a family of data processing approaches and methodologies, united under the common names Big Data and NoSQL, is gaining popularity. One of the computational models applied to large amounts of data is Map-Reduce technology, developed in the depths of Google. In this post I will try to talk about how this model is implemented in a non-relational MongoDB DBMS.

As for the future of non-relational bases in general and the Map-Reduce technology in particular, one can argue on this subject ad infinitum, and the post is completely not about that. In any case, familiarity with alternative traditional DBMS data processing methods is useful for the general development of any programmer, as well as, for example, familiarity with functional programming languages can be useful for programmers working exclusively with imperative languages.

MongoDB non-relational DBMS presents data in the form of collections from documents in JSON format and provides various ways of processing this data. Including, there is own implementation of the Map-Reduce model. How practical it is to use this implementation for practical purposes will be discussed below, but for now we will confine ourselves to the fact that this implementation fits perfectly into familiarity with the Map-Reduce paradigm itself.
')
So, what's so special about Map-Reduce?

Suppose we are developing a large online application whose data is stored in a distributed way on several servers in all corners of the globe. In addition, the user has his interests listed. We decided to calculate the popularity of each interest by simply determining the number of users sharing this interest. If we used a relational DBMS and stored all users on a single server, then a simple query using the group by operation would help us to get an answer. In the case of different nodes, we would like this grouping to be performed in parallel, loading all servers evenly. In the world of relational DBMS and SQL queries it is quite difficult to imagine this, but with the help of Map-Reduce this task is completely solvable.

Suppose there is a collection of users in our database whose elements are of the form:

{ name : "John", age : 23, interests : ["football", "IT", "women"] }

At the output, we want to get a collection of this type:

 { key: "football", value: 1349 }, { key: "MongoDB", value: 58 }, //...

During execution, the system performs Map and Reduce operations on the data, which are determined by the programmer. In MongoDB, these operations have the form of functions written in Javascript. That is, the programmer himself writes the functions, and Mongo manages their call.

At the beginning, an operation Map is applied to each document of the collection, which forms pairs <key, value> . These pairs are then passed to the reduce function in a key grouped form. The Reduce operation forms one pair <key, value> . Both as a key and as a value can be any variables, including objects.

Consider the map function for our example:

 function map(){ for(var i in this.interests) { emit(this.interests[i], 1); } }

As you can see, the document to which the Map operation is applied is available by the pointer this . The emit function is used to transmit the next pair <key, value> for further processing. As you can see, the Map operation for one document may produce several pairs <key, value> . In this example, everything is simple - we transfer interest as the key, and unit as the value - since in this case the interest met exactly once.

Formed pairs are grouped by key and passed to the reduce function as <key, list of values> . The reduce function for our example looks like this:

 function reduce(key, values) { var sum = 0; for(var i in values) { sum += values[i]; } return sum; }

To get the total value, we summarize all the values that we have in the values array. The change of the key by the Reduce operation is not provided, so the function simply returns the resulting value as its result.

A clever reader may ask the following question - why run through the entire values array and add its elements if we know that they are equal to one? Is not it easier to return the length of the array? The answer to this question is negative, and an explanation sheds light on a key feature of Map-Reduce.

The fact is that MongoDB runs a Reduce operation to perform intermediate aggregations. Once several pairs have been formed with the same key, MongoDB can perform Reduce for them, thereby obtaining one pair of <key, value> , which is then processed along with the rest, as if it were obtained using the Map operation.

This imposes certain requirements on the implementation of the reduce function. Here they are:

The type of the return value of the reduce function must match the type of the value that is issued by the map function (the second parameter of the emit function)
The following equality must be fulfilled: reduce (key, [A, reduce (key, [B, C])]) == reduce (key, [A, B, C])
Repeated application of the Reduce operation to the resulting pair <key, value> should not affect the result (idempotency)
The order of the values passed to the reduce function should not affect the result.

The second requirement is also an illustration of what might happen. If the <key, B> and <key, C> pairs were received on one node, and <key, A> on the other, the preliminary execution of the Reduce operation on the first node will reduce network traffic and increase concurrency. But the price for this is a significant restriction on the function of reduce , due to the need to maintain the above identity.

After all the Map and Reduce operations are completed, a collection of elements of the form is formed at the output

 { key:"football", value: 1349 },

To run such an operation, you need to declare these two functions in the mongo shell console, and then execute the command:

 db.users.mapReduce(map, reduce,{out:"interests"})

Consider another task. Suppose we want to know the average number of interests of people of different ages. The map function in this case can be:

 function map(){ emit(this.age, {interests_count: this.interests.length, count: 1}); }

The key here is age, and the object is transferred as a value - the number of interests and the number of people of a given age, which divide these interests (for one document is one).

If you look closely at the requirements for the reduce function, it becomes clear that within its framework the arithmetic average cannot be calculated, since this mathematical operation does not satisfy the second requirement. Therefore, all that we can do in the framework of the reduce function is to add the number of interests and the number of users separately:

 function reduce(key, values) { var sum = 0; var count = 0; for(var i in values){ count += values[i].count; sum += values[i].interests_count; } return {interests_count: sum, count: count}; }

In order to get the arithmetic average in the final collection, you can use the Finalize operation - it is applied to the final <key, value> pair obtained after performing all Reduce operations with the key key :

  function finalize(key, reducedValue) { return reducedValue.interests_count / reducedValue.count; }

Command to call:

  db.users.mapReduce(map, reduce, {finalize: finalize, out:"interests_by_age"})

It should be noted that in other implementations of Map-Reduce, including Apache Hadoop, there are no such restrictions on the operation of Reduce . There, the input of the reduce function will always be a complete list of values related to the key. And intermediate aggregation can be accomplished through another operation, Combine , semantically similar to Reduce .

Now a few words about the feasibility of the practical application of the native implementation of MapReduce in Mongo. To perform aggregations in this DBMS, there is a powerful tool called Aggregation Framework, which performs the same aggregations approximately 5-10 times faster than Map-Reduce. But concurrency in this case ends where sorting and grouping begin. There are also certain limitations on the RAM consumed by the operation.

In general, the Aggregation Framework should be used where fast response is required, while Map-Reduce is intended for preprocessing raw information. Mongo provides opportunities to interact with Hadoop, whose Map-Reduce implementation works more efficiently than the native one. Anyway, MongoDB allows you to familiarize yourself with the Map-Reduce model without the need to install and configure such a relatively heavyweight tool like Apache Hadoop.

Source: https://habr.com/ru/post/184130/

All Articles

Map-Reduce on the example of MongoDB

More articles: