MapReduce or calculations outside of the memory and processor (try without any zaumi)

I have long wanted to tell you about MapReduce, and no matter how you look at it, it’s such a thought that it just takes horror, but in fact it’s a very simple and useful approach for many purposes. And to realize yourself is not so difficult.

I'll say right away - the topic is for those who have n't figured out what MapReduce is. For those who figure out - nothing will be useful here.

To begin with, the idea of MapReduce was actually born to me personally (although I didn’t know that it was called that, and, of course, it came to me much later than the Googlelets).
')
First I will describe how she was born (the approach was wrong), and then how to do it right.

How to count all the words in Wikipedia (wrong approach)

And she was born, like, probably, everywhere - to count the frequency of words when there is not enough ordinary memory (counting the frequency of all words in Wikipedia). Instead of the word “frequency” there should rather be “the number of occurrences”, but for simplicity I will leave “frequency”.

In the simplest case, we can create a hash (dict, map, hash, associative array, array () in PHP) and count the words in it.

$dict['word1'] += 1

But what to do when the memory under the hash ends, and we counted only one hundredth of all the words?

I solved this problem by considering part of the words, until the memory runs out, I saved the hash to disk. That is, straight line by line in the file:

aardvark | 5
aachen | 2

There was a problem - and how to combine these files? After all, each of them occupies the entire RAM.

At first there was an idea to take only the most popular 1,000,000 words from each file and merge them - it will fit into the RAM and count at least the top of the list (the most popular words). This, of course, worked, but it turned out that millions of lower words were lost, and there were a lot more.

Got the idea to sort the files.

Then we take 20 sorted files, read the first 1000 lines from each of them, they will be about the same words (sorted files). We summarize and form a new hash, it will contain only words starting with “aaa ...” and the like, which we save into new files. We read the following 1000 lines, all the same. There approximately in all files there will be words "aab ..."

Thus, a new file is already formed which is much smaller. However, it will still repeat the words. Again, sort it, read it by 1000 lines, summarize it. It turns out almost the correct file (some words can still be beyond the 1000 lines), repeat a couple of times ... in the end we get a file with very few errors (but they exist).

It is a chore for a long time, but it has never occurred to a better one

Weak spot wrong approach

There was one weak point in this approach - namely, the merging of the original 20 files. How to make it better?

The problem arises from the fact that some words will not be in some files or they will be in different blocks of 1000 lines each. That is, if I could take from all 20 files not the first 1000 lines, but only one line, but with the same word - I would be able to combine all 20 files in one pass.

How to do it? In general, this is the last step of the MergeSort algorithm - the union of sorted lists. If you know - skip.

We take on the first line of all 20 files, look for the minimum first element (word) - it will be the most minimal at all in all, since our files are sorted. Suppose it will be the word "aardvark" We take from all 20 lines that we have read only those that relate to this word "aardvark". And from there, from where we take it out - only in those files we read the second line. Again, we are looking for the minimum among these 20. By analogy, continue until we reach the end of all files.

MapReduce in the simplest form

Actually, I almost invented for myself what Google invented before me a decade ago and called MapReduce.

The invention of bicycles continues to this day.

So there is a line : "foo bar baz bar" .

You need to get the output: { foo: 1, bar: 2, baz: 1 } .

Step one, take a string , break it up into words, and give out such arrays (or, more precisely, “tuples” - “tuples”):

[ 'foo', 1 ]
[ 'bar', 1 ]
[ 'baz', 1 ]
[ 'bar', 1 ]

(I will continue to omit the brackets and quotes where and so it will be clear)
We take them, we sort:

bar, 1
bar, 1
baz, 1
foo, 1

We notice that bar goes twice in a row, so we merge into this view:

bar, (1,1)
baz, (1)
foo, (1)

(1,1) is like an embedded array, that is, technically it is like this: ["bar", [1,1]] .

Then simply add the second elements of the arrays . We get:

bar, 2
baz, 1
foo, 1

Exactly what they wanted.

The main question - what for goat bayan ... or what are we doing here at all and why?

Back to the past

If we imagine that we have a computer in which only 2 lines fit in and it can perform only one operation with a line per minute . (Stop giggling! After you count all the words on Wikipedia at least once - you have the right to laugh at the memory restrictions that you set, it still doesn’t fit, even if you have many gigs, and if it does, count all over the Internet :)).

We can (from "foo bar baz bar" ) make two files like this:

file1.txt
[ 'bar', 1 ]
[ 'foo', 1 ]

file2.txt
[ 'bar', 1 ]
[ 'baz', 1 ]

We have two lines in memory - everything is in order, we’ve met the memory limits.

Now using the step from MergeSort , we can combine these files line by line:

bar, (1,1)
baz, (1)
foo, (1)

At the same time, in memory, each time we have only two lines stored from 2 files - no more is needed.

Actually, what we have done is already MapReduce.

The step, which from the words produces arrays with singles ( , 1 ) - this step is called “Map” .
The step that summarizes (1,1) is the “Reduce” step .

The rest of the steps will be made by the algorithm itself (sorting and merging via MergeSort).

Map, Reduce? What is it?

These steps themselves do not necessarily consist in giving out edinichki in the case of "Map" or add in the case of "Reduce". These are just functions that can take something and produce something. Depending on the purpose.

In this case, “Map” is a function you have written that takes a single word and returns (, 1) .

And “Reduce” is a function you wrote that takes an array (, (1,1)) and returns (, 2) .

Simply put in Python:

  words = ["foo", "bar", "baz"]
 def map1 (word):
   return [word, 1]

 arr = ["foo", [1,1]]
 def reduce1 (arr):
   return [arr [0], sum (arr [1])]

or PHP:

  $ words = array ("foo", "bar", "baz")
 function map1 ($ word) {
   return array ($ word, 1);
 }

 arr = array ("foo", array (1,1))
 function reduce1 (arr) {
   return array ($ arr [0], array_sum ($ arr [1]));
 }

So, we went around the memory limit, but how to get around the speed limit?

Imagine that we have two such computers. We give each of them the source line and speak to the first (more precisely, MapReduce says): count only words in odd places, and second, count words only in even places.

The first gives:

"foo bar baz bar":
foo, 1
baz, 1

The second gives:

"foo bar baz bar":
bar, 1
bar, 1

We (more precisely, MapReduce) collect results from both, sort, then run through MergeSort, as above:

bar, (1,1)
baz, (1)
foo, (1)

Exactly the same result as when one computer counted!

Now we (MapReduce) distribute again to two computers: we give the first only odd lines, the second we give even and we ask each computer to do a Reduce step (add the second digits).

Actually, it is clear that these lines do not depend on each other, so the result will again be what is needed.

The main thing is that two computers worked in parallel and, consequently, two times faster than one of them (if not for the loss of time to transfer data from one to the other).

Premature withdrawal

Phew! So MapReduce - it is needed in order to consider something that either needs to be done faster, or that there is not enough memory (or both).

A more interesting example is sorting by popularity (cascades)

Suppose we want to count the number of words in Wikipedia and simultaneously build a list in the reverse order of their popularity - from the most popular to the most unpopular.

It is clear that all the words of Wikipedia will not fit into the memory, and then for the reverse sorting, then this giant array will not fit into the memory. We will need a MapReduce cascade - the result of the work of the first MapReduce will be fed to the input of the second MapReduce.

To be honest - I do not know whether the word "cascade" is correct, applies specifically to MapReduce. I use this word for myself, because it explains no more than what needs to be done (the result of one waterfall of words falls in MapReduce and cascades immediately into the second MapReduce).

Okay, how to count the words - we already know:

"Foo bar baz foo"

The Map step we wrote gives:

foo, 1
bar, 1
baz, 1
foo, 1

Further MapReduce combines (itself, not you, as a programmer) them in:

bar, (1)
baz, (1)
foo, (1,1)

And the Reduce step we write gives:

bar, 1
baz, 1
foo, 2

Now imagine that we thought the whole of Wikipedia and this array contains billions and billions of words. Sort it in memory will not work. Take another MapReduce , this time the Map will do this trick:

[, 15] -> map () returns -> [-15, ]
[2, 15] -> map () returns -> [-15, 2]
[3, 120] -> map () returns -> [-120, 3]
[4, 1] -> map () returns -> [-1, 4]

What is it for?

MapReduce, before it goes to your Reduce, sorts all these arrays by the first element of the array (which is a negative number). MapReduce will be able to sort even if the entire volume of data does not fit everything in memory - that's the beauty. For all Wikipedia words, you simply cannot make arsort($words) , but MapReduce can.

Why minus before numbers?

Because MapReduce always sorts in ascending order , but we need descending. How, using sorting only in ascending order, sort the numbers in decreasing order? Multiply by minus one before sorting and again by minus one after.

Ascending positive numbers: 1, 15, 120
Ascending negative numbers: -120, -15, -1 (what we need, only with a minus sign, which we then simply remove by multiplying by -1)

The following will come to the input of Reduce:

-120, (3)
-15, (, 2) <-- - MergeSort !
-1, (4)

Charm, but we had two words “frequency” 15 and they were grouped by MergeSort. We will fix it.

Now, in our Reduce, we can only multiply the first number by -1, and then output one array for the first line, two arrays for the second, and one again for the third.

In fact, depending on what embodiment of MapReduce you will use - you may not be able to produce two arrays in the Reduce step, because only one array will be required at the output - then just after the Reduce step do this in your program.

We get:

120, 3
15, ,
15, 2
1, 4

Beauty! What was needed.

Again, remember that the main thing that we bypassed here is that this is an example of four lines, and in Wikipedia there are billions of words that do not fit in the memory.

How to make the simplest MapReduce to play?

In PHP : the simplest example .
In Python, the simplest example (see below about the Python version).

In the code, I indicated what and where it should be to make a more or less complete MapReduce with blackjack ... in the sense of files and MergeSort. However, this is a reference implementation , so to speak, which will allow you to play around and understand how MapReduce works. This is still MapReduce, just this particular implementation is not any better than the usual hash in terms of memory.

I chose PHP, although it is not the most reasonable for this purpose, because almost any programmer can read PHP, and it will be easier to translate it into the desired language.

Hints and cheats

Yes, I recommend storing the JSON array representation (json_encode) line by file - there will be fewer problems with spaces in words, with unicode, numbers and data types, that is:

["foo", 1]
["bar", 1]
["foo", 1]

Hint - in Python, the last step MergeSort has already been implemented is heapq.merge(*iterables) .

That is, to connect 10 files with JSON views is enough:

  items = list (itertools.imap (json.loads, open (filename)) for filename in files)
 for item in heapq.merge (* items):
   # .... reduce (item) ....

In PHP with the implementation of MergeSort, I suspect you need to mess around in fifty lines. Unless of course in the comments, no one is better than the option.

In Python, yield and __iter__ for MapReduce allow you to do very interesting things! For example:

  x = MapReduce ()
 for word in "foo bar" .split ():
    x.send ((word, 1))

 for word, ones in x:
    print word sum (ones)

class MapReduce - you have to write to yourself (I fit in 24 lines in the simplest working form, you can do less - by simplifying iter_group, this is an analogue of the function group_tuples_by_first_element from the example for PHP).

Be careful - this method is not quite classical for MapReduce and it will be difficult to parallelize it on many machines (however, it’s rather trivial to do work with data volumes more than the available memory). The map_reduce(source_data, map1, reduce1) method map_reduce(source_data, map1, reduce1) , where map1 and reduce1 are functions — more correct.

The implementation of Hadoop MapReduce is the most popular solution. (I did not try it, just know what is the most popular).

Afterword

So, I hope my story about “MapReduce without any bumps” will be useful.

MapReduce is a very useful thing for any large scale calculations. Virtually any SQL query, even on several tables, is not very difficult to decompose into MapReduce + join_iterator (about this another time).

If there are strengths - in the next topic I will describe how to use MapReduce to consider more interesting tasks than commonplace words - such as, for example, to count links on the Internet, the frequency of words in the context of other words, people in cities, products according to the prices of hundreds of companies, etc. .

Yes, all ahtung here! MapReduce is patented by Google, but rather for defensive purposes - the same Hadoop they officially allowed to use this method. So - handle with care.

Part two: more advanced examples .

Yoi Haji
View, as always, with Habra ,
2010

(someday I will learn to explain briefly ....)

Source: https://habr.com/ru/post/103467/

All Articles