⬆️ ⬇️

Implementing a search engine with Python ranking (Part 3)

In the previous section, we learned how to execute a query on a built index, and now we can get links to documents that contain what we requested. But there is a problem: this is just a list of documents, in which, perhaps, there is what we need. It is not sorted by importance, for us, the information contained in the document. We will talk about this problem in this part.



Ranking query results



The final step in building the search engine is to create a system for ranking documents by their relevance to the query. This is the most difficult part, because it does not have a direct technical solution: it requires creativity and your own view. In this we implement TF-IDF ranking (from the English. TF - term frequency (word frequency) and IDF - inverse document frequency (reverse document frequency)), which is one of the simplest ways to sort our documents. In this part there will be no code, but you can examine the final version of the engine on GitHub . We will only study the theory of TF-IDF, and its implementation is quite simple, with most of the work being done during the construction of the index.



So the term "frequency" is the first part of our ranking system? Well, this is exactly what comes to mind when you hear it: the number of times each word occurs in a particular document. The term frequency, like a metric, does not take into account the request: it assumes that the document is just an ambivalent set of markers, and you can get an accurate idea of ​​it just by recounting how many times each marker (word) occurs. This is not a very accurate assumption, but it is widely used in the field of document classification. Formally, it is better known as the “bag of words” model.



One simple representation of the bag pattern of words is vectorization of documents. That is, we decompose the document into a vector of length N , where N is the total number of unique terms in the document, and each record is the number of times that particular term occurs in this document. For several documents, a more convenient way to determine N is the number of unique words in all documents in the search space. This allows you to represent each document as a vector, and all these vectors have equal dimensions.

')

But wait! There is, at present, a big flaw in our model. Consider these two documents: “The way they eat cake” and “Let them eat the torus. The way they eat cake. ” They are the same, except for the fact that the second is just a repeating first, but their vector representation is very different: [1, 1, 1, 1] compared with [2, 2, 2, 2]. To solve this problem, we transform each vector into a unit vector by dividing its value (calculated by taking the square root of the sum of squares of each record in the vector), “normalizing” it. This will turn our previous vectors into [½, ½, ½, ½] and [½, ½, ½, ½], making them as we intend.



However, this is still not enough. The frequency of the word is a fatal flaw (this is harmony for any Greek tragedy). This disadvantage is that the bag considers each term as equally important in the presentation of documents. But this is not the case: the word “and” tells us a lot less about the document than the words “Shakespeare” or “chlamydia”. But the word “and” is much more common than the word “chlamydia” (at least in the documents that I read), which creates a false similarity between the documents (since almost all of them contain the word “and”).



To avoid this, we need to add something to our ranking system: the reverse frequency of the document. We define the frequency with which the word is found in the document as Dt , and t as the frequency with which the word is found in all indexed documents. Then, if we have K documents, then the inverse document frequency ( Lt ) of this word will be the ratio of K to Dt : the total number of documents divided by the number of documents in which this word occurs.



There is one more, final point: it corrects too much. If we have 10 documents ( K = 10) and one word appears in these documents 10 times, and another word only 1 time, then the second word is 10 times more important than the first. The linear behavior of this function is too radical, and will artificially reduce the similarity due to too much adjustment. To fix this, we simply add the function of the natural logarithm, which will “align” our function, making its correction smoother. Now our function looks like this: in a set of K documents, for some word t , Lt = ln (K / Dt) , where Dt is the document frequency of the word t , and ln is the function of the natural logarithm.



Implementation note : as you can see, neither of these values ​​depends on the query, and both of them can (and should) be calculated for each marker (word, term) and document in advance!



Now, to combine the terms word frequency (in a document) and the reverse frequency of a document into one metric, we can simply multiply them. That is, the weight of each marker (word, term) in our set of documents is defined as Tt Ă— Lt : the frequency with which the word in the document occurs and the reverse frequency of the document.



Now, if we have a set of K documents and all these documents have a total number of N unique terms, then our documents will be represented as vectors, each are N , where the meaning of each record (which corresponds to the term) is equal to the frequency with which the term occurs in document multiplied by the reverse document frequency for this term in the document set. Each vector will have a value of 0 for terms not found in the document (remember that our vectors represent all unique terms in the entire set of documents). The reverse frequency of the document will never be 0, because this is the level of collection of metrics.



Now we will do the same for queries: we will represent it as a vector in N- dimensional space, like documents, and calculate the TF-IDF for each term in the query. Obviously, it will be more rare (scattered) than in documents.



A small digression. Let's try to calculate the similarity index between the query and its result set, and rank the documents in the result set due to this. There are many different approaches to this, but I will use here what is called the Otiai coefficient (cosine similarity), which essentially just takes the dot product of the query and the document vector in the result set and divides it into the product of the values ​​of these two vectors, which returns cosine the angle between these vectors (I could incorrectly explain the formula “in words,” so if you're interested, you can read the Wikipedia article and this question on StackOverflow for clarification). This is a particularly useful indicator, since it does not take into account the magnitudes of the two vectors when calculating the similarity (unlike, say, Euclidean distance), which is very important when you have one very rarefied vector (query) and another much less rarefied vector. (our document).



So, for ranking results, this is what we will do:

  1. First, you need to calculate in advance the TF and IDF values ​​for each term, and build a vector of length N for each document, using the product of TF and IDF as a record.
  2. Then, we compute the query, and get a set of relevant documents (using the previously described technique).
  3. After that, we compute the query vector, which also has length N and uses the product TF Ă— IDF as each of these records.
  4. Then, we calculate the similarity of the request and each document in the result set (using the Otiai coefficient), and get a score for each document.
  5. We sort the documents by this score.


And boom, we did everything!



Conclusion



Creating a search engine that can scale to the size of Google is incredibly difficult. But, to make it simple for personal use (or even as proof of concept) is not so difficult at all. In fact, the method of constructing search engine indexes, ranking and querying documents is quite understandable, and building the engine is an exercise that is worth doing.



The bag pattern of the words that we used here is shown everywhere. Another great module for the search engine is a spam filter, or document classifier, or article recommendation, or any other module. This is a very cool concept, and you should think about how you will use it to implement any of the above (or something cooler).



Anyway, this is the end of the article. If you have any feedback or questions, you can leave a comment on the original article or write to the author at email / facebook / any other new-fangled social network that your children are currently using.

Source: https://habr.com/ru/post/263983/



All Articles