In the previous section, we built an index, but we still cannot execute queries on it. About this I will tell in this article.Querying the Index
So, there are two types of requests that we want to process: standard requests, where at least one of the words in the request appears in the document and requests with the phrase, where all the words of the request appear in the document in the same order.
However, before we begin, I would recommend processing the query in the same way as we processed documents when building the index, converting all the words, making all the letters lowercase and removing the punctuation marks. I will not go into this, as this is trivial, but it must be done before executing the request.
')
Note: in all the code examples below, each function will use a variable named 'invertedIndex', which is generated in the previous part of the article. For a full understanding of what is happening below, you can see the final result on GitHubWe are going to implement standard queries first. A simple way to implement them is to break up a word query (tokens, as described above), get a list for each word, the documents in which they occur, and then combine all of these lists. Here is how we execute the query for one word:
def one_word_query(self, word): pattern = re.compile('[\W_]+') word = pattern.sub(' ',word) if word in self.invertedIndex.keys(): return [filename for filename in self.invertedIndex[word].keys()] else: return []
This code is pretty simple. All we are doing here is processing input using a regular expression and returning a list of all the keys in the hash table for this word in the index (which is the list of file names in which the word occurs).
Now the standard query is a very simple extension. We simply aggregate and combine lists as shown here:
def free_text_query(self, string): pattern = re.compile('[\W_]+') string = pattern.sub(' ',string) result = [] for word in string.split(): result += self.one_word_query(word) return list(set(result))
If you want to make a query that ensures that every word in the query is in the final list, then you should use intersection instead of aggregating the results of individual queries containing the word in the aggregation. This is rather trivial to do, and I will leave it as an exercise for the reader.
The last type of query is a query with a phrase that is a bit more complicated, since we must ensure the correct word order in the documents. Here is the code for this request (I will explain later):
def phrase_query(self, string): pattern = re.compile('[\W_]+') string = pattern.sub(' ',string) listOfLists, result = [],[] for word in string.split(): listOfLists.append(self.one_word_query(word)) setted = set(listOfLists[0]).intersection(*listOfLists) for filename in setted: temp = [] for word in string.split(): temp.append(self.invertedIndex[word][filename][:]) for i in range(len(temp)): for ind in range(len(temp[i])): temp[i][ind] -= i if set(temp[0]).intersection(*temp): result.append(filename) return self.rankResults(result, string)
And so, we again first process the text of the input request. Then we run one word from the query for each word in the input, and add each of these lists of results to our general list. Then we create a set named 'setted', which takes the intersection of the first list with all the other lists (essentially accepting the intersection of all the lists), and leaves us with an intermediate result: the set of all documents containing all the words of the query.
Now we have to check the intermediate result. So, for each list in the intermediate result, we first make a list of the position lists of each word in the input request. Then (attention!) We use two nested for loops to iterate through the list of lists. For each position in each list, we subtract the number
i , which is incremented by 1 when we go through the list of lists. So remember, lists in Python preserve order, so this list of lists contains lists of the positions of each word in the original query in the order of the words in the original query. Then, if these words are in the correct order, and we subtract an integer
i from each position in each list of positions, and
i is incremented by 1 each time we go through the next list of positions, then if these phrases are in the intermediate result, the intersection of all these modified lists of lists should be at least one long.
Let me give an example:
Suppose the phrase in the request is “cake is a lie”. Now suppose that for a specific file, the positions of each word are:
: [19, 35, 12] : [179, 36, 197] : [221, 37, 912]
Now, our list of listings is as follows:
[[19, 35, 12], [179, 36, 197], [221, 37, 912]]
Now, we subtract
i from each element in each list, where
i is zero for the first list, 1 for the second list, 2 for the third list, etc.
[[19, 35, 12], [178, 35, 196], [219, 35, 910]]
Now we will take the intersection of all the lists in which the value with the number of element 35 remained. After that it would be possible to determine that the phrase “cake is a lie” is in the file. And this is true: if we look at the original list, we will see that the sequence "35, 36, 37" gives us our phrase.
There are many more query parameters that you could implement yourself (look at Google’s
advanced search for inspiration). You can try to implement some of them on your search engine.
Our final step is to implement a query parser, which allows us to combine different types of queries to get one result set. For example, how can you enter something like “cake“ is a lie ”in Google, which combines standard queries (for the entire query) and phrase phrases (“ this is false ”). It's also very easy to do: just use delimiters (for example, quotes) to specify a specific type of query, run each of the smaller queries separately, and then cross all of these result sets to get a final list of documents.
In the
next , final, part I will tell you about the ranking of the results and draw a conclusion.