Map / Reduce: solving real problems - TF-IDF

Yesterday I asked a question in my HabraBlog - are people interested in learning what Hadoop is in terms of its real use? It turned out interesting. The matter is not long - I wrote the article pretty quickly (at least its first part) - at least, because I knew for a long time what I was going to write about (because I still remember quite well how I poked myself looking for information when I started using Hadoop ). The first article focuses on the basics - but not about the ones that are usually talked about :-)

Before reading the article, I strongly recommend to study at least the first and last sources from the list for reading - their understanding or at least reading almost guarantees that the article will be understood without problems. Here we go?

What is Hadoop?

')
Tell me, what's the point of writing about it? More than once it has been spoken, posts on Hadoop, HDFS, and others began to be written repeatedly. Unfortunately, usually everything ended with a rather lengthy introduction and the phrase “To be continued”. So: this is a sequel. To some, the topic touched upon in this article may seem completely trivial and uninteresting, however, it’s the beginning of a disaster — any complex tasks must be solved piecemeal. This statement, in particular, we implement in the course of the article. Immediately, I’ll try to avoid writing code in this particular article — it can wait, and you can understand the principles of building programs that work with Map / Reduce “on cats” (besides, with the current frequency of drastic changes to the Hadoop API, any code becomes about a month later).

When I started to deal with Hadup, the initial understanding of the Map / Reduce ideology was very difficult for me personally (I prefer to write this phrase just to emphasize that this is not about the product, but about the principle). The essence and value of the method will be understood at the very end - after we solve a simple task.

The number of words in the body
The number of terms in the body (by the term hereinafter I will understand the unique word-token)
The number of documents in the body
How many times each term occurs in each document
How many documents each term contains.

Let's face it - the task is simple and solved in the forehead is quite simple - I do not think that it will take you more than half an hour to write such a program. Everything becomes more complicated if the text becomes larger , the number of words grows without limit , the number of terms approaches the number of words in the text language, and dictionaries no longer fit in memory. It's time to remember that the first principle of not only developing complex software, but also solving problems in general, was formulated quite a long time ago and is formulated as “divide and rule”. So, we will divide the task into components.

To begin with, we will make an assumption (for simplicity, and later we will look at how to solve this problem in a more general case) that our input data is presented in a text file format with a very simple structure:

  <w11> <w12> <w13> ... <w1N>
 <w21> <w22> <w23> ... <w2M>
 ...

In other words, each document is a set of words that are possible (and quite likely) are repeated, and each such set of words is located on one line of a text file. The assumption is small - almost every document can be presented in this form.

Counting the number of words in the body is a simple task, and is solved in linear time, however, it may not even have to be solved separately. The same applies to the number of terms - they count themselves. The most interesting task at this stage is to calculate how many times each term occurs in the body of the text.

Yes, yes, you are not mistaken, it is she who is the most popular and already tortured task, the first example in every Hadoop tutorial is WordCount. It is as popular as it is simple - I will not even bring it here, it can be viewed in the official tutorial of Hadup. In short, at the map step, the program forms the following pairs:

  map1:
     <term1> 1
     <term2> 1
     <term3> 1

 map2:
     <term2> 1
     <term3> 1
     <term3> 1

 map3:
     <term1> 1
     ...

At a reduce-step, each reduce-task gets a key (that is, <term1>, <term2>, and so on) and a list of all values associated with this key, obtained from all map tasks (in this case, just a list of ones). It will look like:

  <term1> 1.1
 <term2> 1.1
 <term3> 1,1,1

Summing up these units (and in fact - simply counting the number of elements in the list), we get the number of occurrences of each term in the body:

  the 19283
 to 3432
 from 343
 ...
 ...
 london 14

This is already something, although the value of this data is not obvious. However, a simple count of the number of lines in the resulting file gives us the number of unique terms. And summing up all the values from the second column, we get the total number of tokens. It seems that they didn’t do anything - and they already received several fundamental characteristics of the hull.

Next comes the classic information retrieval. First, based on the results of WordCount we build a dictionary - that is, a general list of corpus terms. Our next task is to establish how often and exactly which dictionary terms are found in each of the documents. To do this, we are already implementing a slightly modified version of WordCount , which considers the number of terms applicable to a specific document. Probably the easiest way to achieve this is to use a key in the results of map tasks, consisting of the document identifier (the mapper input key) and the term:

  map1:
     1_the 1
     1_the 1
     1_the 1
     1_to 1
     ...

 map2:
     2_the 1
     2_the 1
     2_from 1
     ...

 map3:
     37_london 1
     ...

Reduce for this task will be identical to the classic WordCount - it will simply summarize the values with the same key. As a result, we get:

  1_the 3
 1_to 1
 ...
 2_the 2
 2_from 1
 ...
 37_london 1

So what we got? And we got the so-called frequency of terms - which is much better known as term frequency , abbreviated tf (t, d) (here t and d mean that the value is considered applicable to a specific document and a specific term). For example, in the article about London, the value of tf for the word london will probably be higher than in the article about pig breeding (and maybe it will be equal to zero - the zero frequency is also a frequency). Probably, it should be noted that we have obtained an unnormalized version of this characteristic, to normalize the obtained values should be divided by the total number of tokens in the case.

As part of our example, we developed an algorithm for calculating one of the most popular statistical characteristics in information retrieval. The value of this method is that it can be extended to a body of almost any size - and this can be calculated even on one machine (or you can easily parallelize to a cluster of one and a half to two thousand nodes). Thus, the answer to the question formulated at the very beginning of the article is as follows: the Map / Reduce ideology allows us to break the computationally complex task into small blocks that can be counted separately, and then combine the results. These blocks can be considered in parallel, they can - in series and this makes no difference: the point is that we have turned one extremely resource-intensive task into a large number of tasks, each of which can be solved on your home computer.

Perhaps, here I should still say the sacred phrase - “To be continued”. In the next post, we will look at the calculation of the second part of tf-idf - namely, the inverse document frequency , after which we will proceed smoothly to the solution of this problem for a real large (no one will find it) data set.

PS A small note that seemed important to me: when writing the Russian version of the article (and it was originally written almost in parallel in two languages ) I tried to write as much in Russian as possible, but I did not translate many stable combinations (such as Map / Reduce) and did not even try to translate the names of the phases of this process, from there appeared map-drag and reduce-drag. Unfortunately, Russian terminology is not fully established applicable to this subject, but the great and powerful is so great and powerful that any student can decline the word “task” in cases - not to mention the programmers who represent the target audience of this post.

If you thought something incomprehensible - please write. After you have been working in a certain area for a long time, to some extent your brains get “washed up” and you take many things for granted. If somewhere it was the place to be - write, and I will correct.

_________________________________________________________________

References for home reading:

Yahoo! Hadoop Tutorial - I recommend reading it first, because there is simply no better documentation at the moment, including the official website.
Hadoop QuickStart Guide
Hadoop Map / Reduce Tutorial
Hadoop and Distributed Computing at Yahoo!
Term frequency-inverse document frequency - Wikipedia article.

The original photo was published under the Creative Commons license: www.flickr.com/photos/antichrist/3427853501

Update: since the UFO has temporarily disabled the ability to create new blogs, it has published in algorithms - after all, Hadoop is not the only implementation of Map / Reduce, and not a single line of code is here. When the UFO has mercy, I will create a Hadoop blog and transfer it along with new articles that are being written.

Update 2: I also said that to be continued? Well, here it is - this is the very continuation - read and comment!

Source: https://habr.com/ru/post/74792/

All Articles

Map / Reduce: solving real problems - TF-IDF

What is Hadoop?

More articles: