Hadoop: solving real problems

Greetings to you, dear habrareyudi (and just people, because I do not aspire to make differences). I got the idea here to write a great article about Hadoop. And not just an article, but an article that will describe a real (or almost real) task that may well be necessary and interesting for respected users - to be more precise, a statistical analysis of a very large amount of data, for example, English Wikipedia (the dump weighs 24 GB or so).

There are several problems here. First, the Wikipedia dump is XML. Working with XML on Hadoop is still a pleasure, but if you figure it out, it's not so bad. Secondly, this is not big yet, but already a significant amount of data - you have to start thinking about the size of the split, the number of map tasks, and so on. The third problem - most likely, I will describe the development of the system for the “cloud” cluster, which is somewhere, but neither configure nor administer anything - unfortunately, not everyone has access to such systems, so it would be nice to write for starters how you can configure a simple, but real cluster.

Do users have such interest? The topic is interesting, and I sincerely believe that in tasks where processing very large volumes of information is required, grid calculations are generally almost the only sensible solution for several reasons at once. In my blog, I occasionally throw all sorts of geek observations and thoughts on Hadoop, but writing in a personal blog and writing in Habr is, you see, two big differences.
')
So. If interested - sign off, and I will gradually begin.

Update: as a task, the calculation of tf-idf is proposed - in my opinion a completely normal example (besides having a lot of practical applications).

Update2: and the article is ready :-) tomorrow I will write it off once more and post it.

Update3: sigizmund.habrahabr.ru/blog/74792

Source: https://habr.com/ru/post/74748/

All Articles

Hadoop: solving real problems

More articles: