A long time ago I asked myself the question “How effectively does MapReduce work?”
There was such an opportunity and on a cluster consisting of 4 nodes in such a configuration, I decided to test:
- 3 nodes: Intel Xeon CPU W3530 @ 2.80GHz 12GB RAM
- 1 node: Intel Xeon CPU X5450 @ 3.00GHz. 8GB RAM
Operating system debian, hadoop 1.2 (from off.sayta), java 7 (From ORACLE).
')
Initial data:
- CML file:
dumps.wikimedia.org/enwiki/20130904/enwiki-20130904-stub-meta-current.xml.gz- in unpacked state, the file takes 18GB of space.
- 31M wiki page entries.
- Bzip2 compresses this file to 2GB
- 593.045.627 lines in the file
Example of one entry:
<page> <title>AfghanistanHistory</title> <ns>0</ns> <id>13</id> <redirect title="History of Afghanistan" /> <revision> <id>74466652</id> <parentid>15898948</parentid> <timestamp>2006-09-08T04:15:52Z</timestamp> <contributor> <username>Rory096</username> <id>750223</id> </contributor> <comment>cat rd</comment> <text id="74089594" bytes="57" /> <sha1>d4tdz2eojqzamnuockahzcbrgd1t9oi</sha1> <model>wikitext</model> <format>text/x-wiki</format> </revision> </page>
As a test, I took a simple puzzle that can be solved both in the console with a traditional tool and with the help of MapReduce. And the problem in two words is expressed in this form:
time bunzip2 -c /mnt/hadoop/data_hadoop/test.xml.bz2 | grep "<title>" |wc 31127663 84114856 1382659030 real 9m32.953s user 10m16.779s sys 0m12.737s
A similar problem was solved on the whole hadoop cluster in 3 minutes and 40 seconds. (yes, with parallel unpacking, unpacking was done by Java, and not natively).
If the file was in the unpacked state (18GB), the processing ended at the hadoop cluster in 2m and 30s. (fastest in 2min and 12 seconds). and in this case the disks are loaded under 100%
Well, to think about it)) the file was previously pinched by pbzip2 ... on Intel Xeon CPU W3530 @ 2.80GHz
time pbzip2 -d -c -p8 /mnt/hadoop/data_hadoop/testpbzip.xml.bz2 | grep "<title>" |wc 31127663 84114856 1382659030 real 2m44.507s user 21m28.493s sys 0m19.833s
I'm not going to draw any conclusion ... but somewhere on the Internet I met that the hadoop cluster starts to show itself from 4 nodes ... probably they had good reason.