A simple technique for building product filters using MongoDb and MapReduce

When I first encountered MapReduce , I was looking for real examples of application for a long time. The notorious search for words in the text, which is found in every second article about MapReduce, we will not consider as the example we are looking for. Finally, in two courses on Big Data on Coursera , I found not only living examples, but a theoretical background for a deeper understanding of what is happening. The ability to apply the knowledge gained was not long in coming.

In this small article I want to share the experience of implementing a product filtering system that is classical for most online stores, according to criteria for a travel portal, where the task of searching and filtering across a database of tens of thousands of hotels appeared, each of which is described by a number of parameters and the presence of several dozen provided services out of hundreds of possible.

Formulation of the problem

There is a database of hotels for several tens of thousands of objects. Each hotel is described:

A number of parameters with a small number of options for each. For example, the star hotel: from 1 to 5. Or the types of food in the hotel: they are also not many. And so on. The difficulty here is only that for some parameters for each hotel there can be only one value, and for some any set of possible. But with a small number of possible options, this is not a problem yet.
Price. Simplifying the task, for the base price is taken the minimum price for double per day with arrival tomorrow. And this price can range from 0 (oh, if ...) to thousands. In filtering there are several price ranges to choose from. Accordingly, it is necessary to look for hotels, the price of which falls within a given interval.
Regional affiliation. The situation is the same as in paragraph 1, except that the number of possible options for the “country” criterion is estimated at two hundred, and for the “city” it is already thousands. The task is facilitated by the fact that each hotel has the only value for the parameters of the country and city.
And finally, the most interesting: a list of services provided by the hotel. Starting to solve this problem, the list of services was more than 700 items. With the help of manual “discarding all unnecessary” and groupings, the number of services was reduced to 300, which is also not small. Than it threatens - further.

')
Now, in fact, the task: to make filtering on an arbitrary set of any combinations of possible values of parameters that describe the hotel and when displaying the filter list, display the number of hotels opposite each filter. Each combination of filters must have a clear URL and reflect a reproducible sample of hotels.

Searching hotels by criteria with the number of objects in tens of thousands and the condition of relative static information is more or less painlessly implemented using manual analysis / optimization of each query and building indices either on two dozen tables in the case of a normalized database, or on one wide table in the case of a path problem solving through denormalization.

But here is a small “beautiful” in the form of counting the number of hotels that are in the sample for each of the possible values of the search parameters when the number of these values approaching 1000, initially caused a small stupor.

Little combinatorics

A similar task for a price aggregator in the distant past was solved head-on and with a large number of restrictions: a huge table was stored in the database with pre-calculated combinations of filters and quantities of goods for each combination. The table was fully recalculated every night and took this operation for quite a long time.

In our case, we want to do something more universal and without any restrictions, for example, on the number of simultaneously defined filters. Why is this condition so important? Answer this question, armed with Excel and basic knowledge of combinatorics.

For this purpose in Excel there is a remarkable function “NUMBERS”, which returns the number of combinations for a given number of elements for a given number of selected elements. For clarity, we take 10 elements and count the number of possible combinations with the selected 1, 2, 3, and so on elements. Get the table:

one	ten
2	45
3	120
four	210
five	252
6	210
7	120
eight	45
9	ten
ten	one

In total: 1023 options. Thus, for the 10 possible filters, we must create a table with 1023 variants of filter combinations, for each of which we must calculate the number of hotels that fall under each filter combination. All is good, but with the increase in the number of filters, the number of options increases to a number that personally plunged me into shock.

With the number of services equal to 200 (and in reality there are many more) and any possible number of selected services, we get the number of combinations of 10 to 63 degrees. In other words, we need to make a table in which we will keep a billion billion billion records with a pre-calculated number of hotels, which must also be updated every day.
Conclusion: to keep something similar in the database, even using bitmaps and updating the information is simply unrealistic. No We must look for another way.

We use MongoDB

When a database for storing 1+ billion hotel prices was chosen as a travel portal, MongoDB was tested in the front row because at that time, I had just completed a course on the above mentioned base and everything seemed to be in iridescent colors: sharding, replication, a document-oriented database, high performance ... But everything instantly broke about the reality: the long-standing hanging ticket at 10gen at the collection level ( more ticket and another ticket ) and extremely slow competitive insertion of documents with a large number of composite indexes did their dirty deed: after a lot of experiments, the base was sent to the scrap with loud stuff. It would seem forever. But as often happens in life: for each task - its own tool. And MongoDb waited in the wings.

Why MongoDB?

For the task, it was Monga with her schema-less and document-oriented approach could not be better: we present each hotel as a document with an arbitrary set of attributes and (importantly!) With the ability to store arrays of values for each of the attributes. This means that we can store lists of services and other parameters that can take several values at the same time, in arrays inside the document, index them and make arbitrary searches on the collection of documents. Of course, besides MongoDB, variants of other NoSQL databases were considered, but it was Mong that won out with its simplicity and ability to store and perform an indexed search on collections inside documents.

I quote the clipped (hi, NDA) structure of the document describing the hotel:

"_id" : ObjectId(), "hotel_id" : int, "country_id" : int, "city_id" : int, "category_id" : int, "services" : [int, int, int...]

By crown, this collection is updated based on information from the main database using the Mongow upsert .

And from this collection there is no problem to get a list of hotels with an arbitrary set of filters. For example:

{country_id: 1, services: {$ all: [10, 20]}} - all hotels from country 1 that have service 10 and 20 at the same time.
{category_id: 5, services: {$ in: [20, 30]}} - all hotels of category 5 that have a service of 10 or 20.

Exactly the same filters are formed not only for services, but also for all other parameters describing the hotel. Thus, we avoid the need to write ten-level SQL queries to the main database. At the same time, the performance remains at a very high level, because With relatively static collections on thousands or tens of thousands of documents, with reasonable composite indexes and enough RAM, MongoDB does a great job.

Fine! And what have the MapReduce?

So, after parsing the parameters of a GET request with filters, we can make a request to MongoDb to get a list of hotels. Here comes the turn of forming a list of filters and getting the number of hotels that fall under each of the filters. The tool is obvious from the article title: MongoDB & MapReduce .

I will not be long to paint the theory of MapReduce. Let me just remind you that this approach to data processing consists of two stages: mapping and convolution. Moreover, each stage can occur in parallel on several cores / processors / servers / clusters and work on a single data array. As data is prepared by a mapping operation, through shared-memory (in our case with Monga), this data gets into convolution, where they undergo final processing and transformation into the required array.
Understanding this theory, in practice everything turns out to be surprisingly simple. I give the source code of the map and reduce methods:

 var map = function() { emit({"category": this.category_id}, 1); if (this.services) this.services.forEach(function(value) { emit({"service": value}, 1); }); if (this.types) this.types.forEach(function(value) { emit({"type": value}, 1); }); // …................................. // And a large number of other filters }; var reduce = function(key, values) { return Array.sum(values); };

How it works?

Very simple!

The mapping function gets each of the hotels stored in the MongoDb database and for each of the services / types of food / hotel location (and so on according to the list) the emit () function is called, which is added to an associative array in the memory “1” for each of services.
The convolution function to the limit is simple: summing the digits of the quantity for each of the elements of the associative array obtained at the mapping stage.

That's all!

At the exit, we get an array of all services with calculated numbers of hotels for each service. Given that this whole operation is perfectly parallel, the speed of formation of this array is obtained in a split second. Well, do not forget about caching, of course. In our case, caching can be used to its full extent, since the information is relatively static, the sample takes up a bit of space and can be easily added to memcached by key from the Whirlpool hash based on the original GET request.

The main trick is that the MapReduce function can be applied to part of a hotel collection. Those. we can use the same query that was created to get a list of hotels, use the MapReduce function when calling and perform all actions often on a very small total number of hotels. As a side effect of this approach, we receive only those services in the array in the required sample in the array with services and get the opportunity to display only those services in the current sample, excluding the options that ultimately a positive effect on usability.

Is it possible to optimize?

Yes you can. The fact is that when processing a large number of hotels (say, 100,000) and provided that each hotel has sets of 100 parameters in total, 10 million calls to the emit () function will be made, which can have an impact on performance on parallelization. And the solution was found:

 var map = function() { var obj = { "categories": {}, "services": {} // Other params }; if (this.category_id > 0) obj.categories[this.category_id] = 1; var attrs = ["services", "types"]; for (var i = 0; i < 2; i++) { var attrKey = attrs[i]; var attrsArray = this[attrKey]; for (var key in attrsArray) { obj[attrKey][attrsArray[key]] = 1; } } emit("attrs", obj); }; var reduce = function(key, values) { var obj = { "categories": {}, "services": {} // Other params }; var attrs = ["categories", "services"]; for (var j = 0; j<2; j++) { attrObjArray = []; for (var i = 0; i < values.length; i++) { attrValuesArray = values[i][attrs[j]]; for (var attrKey in attrValuesArray) { var val = parseInt(attrValuesArray[attrKey]); if (!attrObjArray[attrKey]) { attrObjArray[attrKey] = val; } else { attrObjArray[attrKey] += val; } } } finalAttrObj = {}; for (i = 0; i < attrObjArray.length; i++) { var val = attrObjArray[i]; if (val) { finalAttrObj[i] = val; } } obj[attrs[j]] = finalAttrObj; } return obj; };

In an optimized solution, instead of several emit, at each mapping operation, the hotel’s parameters are packed into the object, which is transferred to emit, where unpacking, counting and folding are performed. Visually, the decision turned out much harder, but the number of bundles is reduced to the number of hotels. Naturally, due to the heavier packing and unpacking operations. As a result, this solution will only work effectively on very large hotel selections and is guaranteed to lose up to 10 thousand hotels on samples. Based on the realities of our project, it was decided not to load the code with unnecessary checks and calls of different MapReduce types for each case and leave this method “in reserve”.

What's next?

Of course, this article outlines only the basic principles of obtaining a sample of hotels and building a filtering system, initially greatly simplified through the assumption that the price of a hotel is also stored in MongoDB. In fact, everything is somewhat more complicated and real magic begins when the user specifies a filter for the price and there is a need to find all the hotels falling under the filters of the services and with the price for a given number of people / nights / dates from a given interval. Especially considering that hotel prices are taken using a request to the REST service . This magic of working with multi-100 databases and how to achieve a response of a maximum of 100 ms in the cold cache and the most difficult queries, I also plan to slowly reveal in the following articles.

Total

With the help of MongoDB and MapReduce, it turned out to create a very easy solution for the resources used, an extremely simple to understand and perfectly scalable solution.
I will be glad to questions!
Thanks for attention!

Source: https://habr.com/ru/post/186572/

All Articles