Looking through the news feed, I came across a recommendation from a Typical Programmer on the article “Implementing a Search Engine with Ranking in Python” written by Aakash Japi. She interested me, there is not a lot of similar material in runet, and I decided to translate it. Since it is quite large, I will divide it into 2-3 parts. At this point, I finish my introduction and turn to the translation.Every time I use Quora, I end up with at least a question like
this : someone asks how Google works and how they could beat it in information search. Most of the questions are not as bold and misleading as this one, but they all express a similar feeling, and in this they convey a significant misunderstanding of how search engines work.
But while Google is incredibly complex, the basic concept of a search engine that searches for matches and evaluates (ranks) results relative to a search query is not particularly difficult, and anyone with basic programming experience can understand this. I don’t think that at the moment it is possible to surpass Google in the search, but making the search engine is an entirely achievable goal, and in fact this is a rather instructive exercise that I recommend trying.
')
This is what I will describe in this article: how to make a search engine for local text files for which standard queries can be processed (at least one of the words in the query is in the document) and the entire phrase (the whole phrase appears in the text) and can rank using a basic TF-IDF scheme.
There are two main stages in the development of a search engine: building an index, and then, using an index, answer the query. And then we can add a rating result (TF-IDF, PageRank, etc.), request / document classification, and, perhaps, some machine learning to track recent user requests and select results for this to increase search engine performance.
So without further ado, let's get started!
Index building
Thus, the first step in building a text search engine is to build an inverted index. Let me explain what it is. An inverted index is a data structure that maps markers to documents in which they appear. In this context, we can simply consider a marker as words, thus an inverted index, basically, this is something that takes a word and returns us a list of documents where it occurs.
Firstly, however, we must analyze and mark (mark, word by word) our set of documents. We do this as follows: for each document we want to add to our index, we will remove all the punctuation and divide it into spaces, create a temporary hash table that relates the names of the documents to the list of markers. We will transform this hash table several times until we reach the final inverted index, which I described above (but with a little complication, which I will explain later). Here is the code that will do the initial text filtering:
def process_files(self): file_to_terms = {} for file in self.filenames: pattern = re.compile('[\W_]+') file_to_terms[file] = open(file, 'r').read().lower(); file_to_terms[file] = pattern.sub(' ',file_to_terms[file]) re.sub(r'[\W_]+','', file_to_terms[file]) file_to_terms[file] = file_to_terms[file].split() return file_to_terms
There are two things that I have not done here, but I recommend to do. Remove all stop words (“how”, “and”, “to”, etc., which do not add relevance of the document) and convert all words (thus “running” and “runner” turn into “run”), using an external library (although this will slow down indexing).
Now I know that the inverted index I mentioned will be a word map to the document name, but we also want to support queries with phrases: queries not only for words but also for words in a certain sequence. To do this, we need to know where each word appears in the document, so we can check the word order. I index each word in a bulleted list on a document as a word position in this document, so our final inverted index will look like this:
{word: {documentID: [pos1, pos2, ...]}, ...}, ...}
instead of this:
{word: [documentID, ...], ...}
So our first task is to create a word mapping for our positions for each document, and then combine them to create our complete inverted index. It looks like:
This code is quite understandable: it accepts a list of terms in the document, separated by a space (in which the words are in their original order), and adds each to a hash table, where the value is the list of positions of that word in the document. We build this list many times, as we go through the list, until we have passed all the words, leaving us with a table, provided with keys along the lines and marked up to the list of positions of these lines.
Now we need to combine these hash tables. I started this by creating an intermediate index format.
{documentID: {word: [pos1, pos2, ...]}, ...}
which we then convert to our final index. This is done here:
This code is very simple: it simply accepts the results of the file_to_terms function, and creates a new hash table labeled with a key by the file name and with the values that are the result of the previous function, creating a nested hash table.
Then, we can actually build our inverted index. Here is the code:
So let's break it down. First, we create a simple hash table (Python dictionary), and we use two nested loops to iterate through each word in the input (input) hash. Then, we first check if this word is present as a key in the output (output) hash table. If this is not the case, we will add it by setting another hash table as the value, which matches the document (identified, in this case, by the variable
filename ) to the list of positions of this word.
If it is a key, then we perform another check: if the current document is in each hash table of the word (the one that matches file names with the position of the word). If this is a key, then we do another check: if the current document is in the hash table of each word (the one that matches the names of the files on the word posts). If so, we expand the list of current positions with this list of positions (note that this case was left only for completeness: this will never happen, because each word will have only one list of positions for each file (filename)). If this is not the case, then we set equal positions in the list of positions for this file (filename).
And now, we have an index. We can enter a word, and should receive a list of documents in which it appears. In the
next article I will show how to run a query on this index.
All code used in all parts (along with the implementation) is available on
GitHub .
PS This part ends. I hope that everything is translated quite clearly, and most importantly - correctly. Ready to accept comments and advice on the design and translation.