📜 ⬆️ ⬇️

MapReduce: more advanced examples, try without any zaumies

In order not to put it off the shelf, I immediately tell several other examples for MapReduce, promised in the topic " MapReduce without any zaumi ". (If you do not fully understand what MapReduce is, read that topic first! Without it, you will not understand)

Let's talk here about the calculations of nationalities in cities, average ratings and drives of students, TIC, PageRank, inbound links, niche keywords, words, synonyms, social networks and mutual friends. We will try to do without mathematical signs and zaumi.

However, the topic itself is complex and yet you have to strain your brains. When you understand - it will be very simple.

Inbound links

Suppose we have the Internet. There are outgoing links on the Internet.

Suppose at the entrance we have such data about OUTGOING links, collected by our spider:

habrahabr.ru -> thematicmedia.ru, apple.ru, microsoft.com, ubuntu.com, yandex.ru
thematicmedia.ru -> habrahabr.ru, autokadabra.ru
autokadabra.ru -> habrahabr.ru, yandex.ru

Those. we know that Habr refers to Apple, MS, Ubuntu and Yandex but who refers to Habr? Yes, the question is primitive, but nevertheless we will decompose it into MapReduce. It will be more interesting further and this example will be required.

The “Map” step (which we write ourselves) does the following:
(One more time - if you do not understand what it is about - read " MapReduce without any zaumi ").

All that goes in the examples after " // " is a comment, only explains where it came from, is not directly related to MapReduce.

map("habrahabr.ru -> thematicmedia.ru, apple.ru, microsoft.com, ubuntu.com") ->
["thematicmedia.ru", "habrahabr.ru"] // thematicmedia.ru habrahabr.ru
["apple.ru", "habrahabr.ru"]

map("thematicmedia.ru -> habrahabr.ru, autokadabra.ru") ->
["habrahabr.ru", "thematicmedia.ru"]
["autokadabra.ru", "thematicmedia.ru"]

The “Reduce” step does not do anything, because at the entrance we already get this:

["autokadabra.ru", ["thematicmedia.ru"]]
["habrahabr.ru", ["thematicmedia.ru", "autokadabra.ru"]] <-- ,

Here is the answer - INCOMING links.

What sites give TIC or PageRank?

Aaaa, TIC, the holy grail of the link trading ( circa 2010). Let's take some conventional rating unit and call it MegaRank (it could be TIC, maybe PageRank, etc. - there is no difference).

Suppose such a theory - each site has a certain MegaRank, which it “transmits” in some quantity, or “does not transmit” and generally “kills” MegaRank. We only know MegaRank of end sites, how to find out which of the initial ones how much MegaRank transmits?

Again we have the initial data Site A has MegaRank 100 and incoming links from sites B, C, D. Site B has MegaRank 0 (we assume that MegaRank is “killed” by bad links, since we cannot exclude this theory) and has incoming links from sites A, D, E, F and so on ...

We decompose the problem like this: Site A received something from B, C, D, which became equal to 100. We assume that each of B, C, D gave site A 33 MegaRank units. Site B has a killed MegaRank, we will assume that its MegaRank = -500, and not zero. Thus, we will try to significantly underestimate those who "kill" MegaRank (say, site A, which has bad advertising and is undervalued by search engines). It turns out that each of the sites A, D, E, F gave -500/4 = -125 MegaRank units to site B.

How to calculate incoming links for outgoing - I have already described above, we will assume that we have already done this ...

So, we have the input data:

A (100) <- B,C,D // MegaRank=100 B,C,D
B (-500) <- A,D,E

Our “Map” does the following:

A (100) <- B, C, D:
B, 100/3 = 33 // B 33 MegaRank B A
C, 33
D, 33

B (-500) <- A, D, E, F:
A, -500/4 = -125
D, -125
E, -125
F, -125

At step Reduce comes:

A (-125) // "" -125 MegaRank
B (33)
C (33)
D (33, -125) // "D" 33-125 = -92 MegaRank
E (-125)
F (-125)

Reduce adds everything up again, we find that A is the most useless site, as well as E, F, which in the absence of other data really refer only to the site with the MegaRank reset. But D, despite the fact that it refers to both A and B - is “suspicious”, but better than just A.

Now we can build them in descending order of the transmitted MegaRank with an example from MapReduce without any zaumi (sorting by popularity).

This example shows much better why MapReduce is needed - we can simultaneously analyze millions of sites with billions of links and run into the problem of limited memory even on one machine, and attach other machines for more parallel power.

(And before you run to ask - yes, I checked the theory on TIC, yes, it works :) I just have this TIC without interest somehow).

Thus - having only data about outgoing links from sites and a certain parameter (TIC, PageRank) of sites to which these links are directed - we can find those sites that convey this parameter well.

Percentage of population in cities

Suppose we have information on city districts, that in the Gdetto district of Gorod1 there live 150 Russians, 190 Belarusians, 3 Udmurts, and in the Tutto district of Gorod1 3 Russians, 5 Belarusians live.

Now our task is to calculate the percentage of the population in cities (we know in the areas of these cities). Problem? Millions of districts, the data is not grouped by city and all together will not fit into the memory ...

Very simple on MapReduce:


map(" Gdetto Gorod1 150 , 190 , 3 ") ->
"gorod1", [ (150, ""), (190, ""), (3, "") ]

map(" Tutto Gorod1 3 , 5 ") ->
"gorod1", [ (3, ""), (5, "") ]

map(" Butto Gorod2 1 ") ->
"gorod2", [ (1, "") ]


At the entrance:
"gorod1", [ (150, ""), (190, ""), (3, ""), (3, ""), (5, "") ]
// , -
"gorod2", [ (1, "") ]

I think that it is obvious that further it is just possible to add up these values, which will quite fit into the memory and be divided by the amount of people in each city.

Another option is how you can do it - you can think it out for yourself.

Who has at least two mutual friends?

Suppose A, B, C, D, E - invented 30 million users of the social network Fingers in the Rosette. Suppose we know that:

B, D, E
B A, D, E
C B, E

How do we find one of them has at least two mutual friends?

Very simple on (guess?) MapReduce. We take over each pair of friends for each person and put them in the first place.


(B,D), A // "" " B D"
(B,E), A
(D,E), A
(A,D), B
(A,E), B
(D,E), B
(B,E), C

“Reduce” receives (and does nothing, except for selecting all the lines in which the second element has several values):

(A,D), (B) // "A,D" "B"
(A,E), (B)
(B,D), (A)
(B,E), (A,C) <-- - "B,E" "" ""
(D,E), (A,B) <--


A and C have common friends B, E
A and B have common friends D, E

Achtung! Algortim O (n 2 ) , that for those who are in the tank - means that with a large amount of input data has a good chance to finish the job a little earlier than the sun goes out. In fact, of course, earlier, but quadratic algorithms are bad.

Find niche topics in keywords

Suppose we have some keywords:

Suppose we want to highlight among them what we could write about. Such a “niche” theme can be considered something that has at least a couple of different keywords with the participation of this word and is in itself a key word. For example, “coffee makers” is a topic, because “coffee makers” and “Babrababr coffee makers”, but “garage” is not a niche, because they themselves are not the key word. “Mile” is also not a “niche”, because it does not have different keywords, except for itself. Well, how? Ready to calculate this without MapReduce? :)

And it is very simple. We take each line and break into words (preferably into n-grams, but this is already). For each of them, we give out “FULL” if this word occupies the entire line, or “PART” if it occupies only a part. Those. from the example above:


, PART // " "
, PART // " "
, FULL // ""

“Reduce” to the input receives:

, (PART)
, (FULL)

Now we just look where there is PART and FULL: “babrababr”, “gate” and “coffee makers” - what we were looking for - our “niches”.

Search for pseudo-synonyms

Suppose we have the Internet. Suppose there are many phrases there, but among them we notice that there are phrases like “bought a big computer”, “bought an expensive computer” ... Here we think that “expensive” and “large” are pseudo-synonyms. We need them to make bad websites and earn WebCash until the Gotoooo search engine throws out our bad websites.

The problem is that there are billions of words, memory is not enough. Yes, and it may turn out to be “bought a computer yesterday”, and “yesterday” is not a “big one” at all, and in general it was here by chance once worried ... What to do? What kind of algorithm to use (okay, okay, remove the stones, I know that you all guessed it). Go!

Input data we take the text and directly in order choose from it three words each, for interest I will leave only what is needed:
" "
" "
" "
" "
" "

Stage 1 (we will need a cascade of two stages):


" * ", "" // " "
" * ", "" // " "
" * ", ""
" * ", ""
" * ", ""

The asterisk is just an asterisk, which means “any word here”, so that Reduce comes like this:

" * ", ["", "", ""]
// "" ""

" * ", ["", ""]

Reduce does nothing.

Stage 2 (cascade):

“Map” - it gets what is written a little higher at the input, and gives the pairs a search from the second values ​​([“big”, “yesterday”, “expensive”]):

, // ["", "", ""]
, // ["", "", ""]
, // ["", "", ""]
, // ["", "", ""]
, // ["", "", ""]
, // ["", "", ""]

, // ["", ""]
, // ["", ""]

On Reduce, they will come as:

, (, , )
, (, )
, (, , )

Do not pay attention that somewhere there are quotes, somewhere there is no - there is no secret meaning in this, just put it somewhere, somewhere else it doesn’t.

Now “Reduce” can only count the words in brackets (the second element of its argument), which will fit into the memory and recline those elements that occur only once:

- (2) // "" - "", 2
- (2)

Achtung! Difficulty O (n 2 ) again!


Suppose we already have two MapReduce, which we need to combine. Suppose one MapReduce considers us the average grade of students in a class of a dozen teachers' journals, and the second - the number of drives to the police for the last year of students, based on data from hundreds of police stations. How do we merge two MapReduce into one result ... in other words, make INNER JOIN or OUTER JOIN.

Make the first MapReduce give (score):
, ("", 3.5)

And the second MapReduce issued (drives):
, ("", 2)

Then transfer all this directly straight in a row:
, ("", 3.5)
, ("", 2)

Then on "reduce" you will get:
, [ ("", 3.5), ("", 2) ]

Instead of conclusion

It’s clear that MegaRanki, niches, friends, synonyms, nations, losers is far from being the only use of MapReduce, but for now, I think this will be enough for you to start thinking how to apply it. I found a lot of uses for it and use it all the time.

As I said before - almost ANY SQL query is decomposed into MapReduce, it just takes a little workout. What for? Then, in order to make faster, and not always, the necessary functions in SQL in general. For example, the same n-gram generator (wordy phrases) from the incoming line ... Maybe I’m certainly mistaken, but the fact that MapReduce is a drop-off and very useful (and amazingly abstrusely described in the network) piece is a fact. I hope you managed to tighten in the network MapReduce.

In the past topic there is a reference implementation (reference implementation) of MapReduce in Python and PHP.

Yoi Haji
View, as always, with Habra ,

(someday I will learn to explain briefly ....)

Source: https://habr.com/ru/post/103490/

All Articles