What every developer should know about the search

Want to implement or refine the search function? That way.

Ask the developer: “ How would you implement the search function in your product? "Or" How to create a search engine? ". Probably, in response, you will hear something like this: "Well, we will simply launch the Elasticsearch cluster: with the search today everything is simple."

But is it? In many modern products , the search is still not the best implemented . This search engine specialist will tell you that few developers have a deep understanding of how search works, and this knowledge is often necessary to improve search quality.

There are a lot of open source software packages, a lot of research has been done, but only a select few understand how to do a functional search. No matter how funny, but if you search the Internet related to the implementation of the search information, you will not find relevant and meaningful reviews.
')

Purpose of the article

This text can be considered a collection of valuable ideas and resources that can help create a search function. The article, of course, does not claim to be exhaustive, but I hope that your feedback will help to finalize it (leave comments in the comments or contact me).

Based on the experience of working with universal solutions and highly specialized projects of various scales (in Google, Airbnb and several start-ups), I will talk about some popular approaches, algorithms, methods and tools.

Underestimating and misunderstanding of the scope and complexity of the search task can lead to bad impressions for users, developers will waste time and the product will fail.

Transferred to Alconost

If you are not eager to move on to practice or you already know a lot about this topic, you may need to skip to the tools and services section right away.

Some common thoughts

The article is long. However, most of the material in it is based on four basic principles :

Search is a very heterogeneous task:

Requests are very different, and the search task is highly dependent on the needs of the product.
On Facebook, this is a search by a graph of people.
On YouTube - search for individual videos.
Both of these cases are very different from the search on the Kayak service: air travel planning is a very time-consuming task .
Next is Google Maps, for which it is important to “understand” geospatial data.
Pinterest - here you can find a photo of breakfast, which you can cook one day.

Of great importance are the quality, performance and organization of work:

There are no wonderful tools here (like the same PageRank), and there is no magic ranking formula that will make everything beautiful. Request processing is an ever-changing set of methods and processes that deal with various aspects of a task and improve system performance, usually gradually and continuously.
In other words, a search is not just the creation of software that deals with ranking or sampling (we will discuss this later) on a specific data set. Usually, search engines are a constantly evolving conveyor of streamlined components that change over time, and together constitute a coherent system.
So, in particular, the key to creating a successful search is to embed evaluation and adjustment processes in the product itself and in the development cycle. The search engine architect must think about processes and metrics, not just technologies.

First of all - existing technologies:

As with many other technical tasks, you don’t need to reinvent the wheel: use existing services or open source tools whenever possible. If there is a SaaS (for example, Algolia or a managed Elasticsearch) that meets the specified constraints and fits into the budget, use it. At the initial stage, such a solution will most likely be the most optimal, even if you then have to tweak and improve it - or even look for a replacement.

️ Read in detail what you are buying:

Even if you are using existing open source software or a commercial solution, you should have some idea of the complexity of the search task and where the pitfalls may be.

Theory. Search task

Each product has its own search. The choice of solution depends on a variety of technical specifications and requirements. It is useful to determine the key parameters of a specific search task :

Size: how big will the data package be (full set of documents to search for)? Thousands, millions, billions of documents?
Information medium: Will the search be by text, images, links on graphs or geospatial data?
Control and quality of the data corpus: are the sources of the documents under your control - or do you receive them from a third party (a likely competitor)? Are the documents fully prepared for indexing or do they need to be cleaned and taken away?
Indexing speed: do you need indexing in real time or is indexing enough in batch mode?
Query language: will the queries be structured - or do you need to understand and unstructured?
Query structure: will the queries be text, in the form of images, sounds? Perhaps it is postal addresses, identification records, people's faces?
Considering the context: do the search results depend on who the user is, what is his history of working with the product, where is it, what time is it now, and so on?
Tips: Do you need support for incomplete queries?
Delay: what are the requirements for service delays? 100 milliseconds or 100 seconds?
Access Control: Is the service completely open - or should users see a limited subset of documents?
Compliance with regulations: are there any restrictions on the part of the legislation or the organization?
Internationalization: Do you need support for documents with multilingual encodings or Unicode? (Hint: always use UTF-8, and if you don’t use it, you should know exactly what you are doing and why.) Will you need to support a multilingual corpus? And multilingual requests?

If you think over the listed questions in advance, it will help to make an important choice in designing and creating separate components of a search engine.

Conveyor indexing in work.

Theory. Search engine

It's time to go through the list of subtasks in building a search engine, which are usually solved by separate subsystems that make up the pipeline: in other words, each subsystem receives output data from previous subsystems and provides input data for subsequent subsystems.

This leads us to an important property of the entire ecosystem: by changing the work of any subsystem, you need to evaluate how it will affect the subsystems that follow it, and, perhaps, change their behavior too.

Consider the most important practical tasks that will have to be addressed.

Index selection

We take a set of documents (for example, the entire Internet, all Twitter messages or a photo on the Instagram service), select a potentially smaller subset of documents that are worth considering as search results and only include them in the index, discarding the rest. It almost does not depend on the choice of documents for display to the user and it is necessary for the index to be compact. For example, the following document classes may not be suitable for an index.

Spam

Search spam of various shapes and sizes is a voluminous topic, which in itself is worthy of a separate guide. Here is a good overview of the taxonomy of Internet spam .

Unwanted documents

With some restrictions on the search scope, filtering may be required: you will have to drop pornography , illegal materials, etc. The relevant methods are similar to spam filtering, but may also include special heuristic algorithms.

Copies

Including almost copies and redundant documents. Here, hashing with location sensitivity , a measure of similarity , clustering methods, and even click data can help. Here is a good overview of such methods.

Useless documents

The definition of utility depends on the area of work search, so it is difficult to recommend specific approaches. The following considerations may be useful. Probably, it will be possible to build a utility function for documents. You can try heuristics; or, for example, an image containing only black pixels — like a sample of a useless document. Usefulness can be assessed based on user behavior.

Index building

Most search engines retrieve documents by means of a reversed index , which is often called simply an index.

Index is a comparison of search terms to documents. The search term can be a word, an image characteristic, or any other derivative of a document suitable for matching queries and documents. The list of documents for this term is called the subject index (posting list). It can be sorted by indicators, for example, by the quality of the document.
Find out if you need to index data in real time. ️ Many companies with voluminous corpuses of documents, using a batch approach to indexing, later come to the conclusion that this approach does not meet the needs of the product: users expect the search results to be relevant.
Text documents typically use natural language processing (NLP) methods for extracting terms, such as a list of stop words, highlighting the stem of words , and extracting objects ; for images and video, computer vision techniques are used, etc.
In addition, metadata and statistical information are collected from documents, for example, links to other documents (which is used in the well-known PageRank ranking signal), topics , number of occurrences of the term, document size, entities mentioned, etc. This information can later be used in building a ranking signal or for clustering documents. In larger systems, there may be several indices, for example, for documents of different types.
Index formats. The actual structure and layout of the index is a complex topic, since the index can be optimized in different ways. For example, there are methods for compressing subject indexes ; you can use data mapping via mmap () or LSM tree for a constantly updated index.

Analysis of requests and selection of documents

Most popular search engines accept unstructured queries. This means that the system must extract the structure from the query itself. In the case of a reverse index, retrieve the search terms you need using NLP methods.

The extracted terms can be used to sample the relevant documents. Unfortunately, in most cases, requests are not very well formulated, so it is necessary to further expand and rewrite them, for example, as follows:

Repeated weight processing of terms .
Spell checker It is very convenient to use chronological query logs as dictionaries.
Search for synonyms .
Named object recognition . A good idea is to use language modeling based on a hidden Markov model (HMM) .
Classification of requests: the distribution of requests for certain types. For example, Google search can identify requests that contain a geographical object, porn requests, as well as requests for something from the news. The sampling algorithm can then decide which shells or indexes should be viewed.
Expansion through personalization or local context . Good stuff for requests like “refueling around me.”

Ranging

A list of documents (obtained at the previous step), their signals and the processed request are given, and the optimal order of these documents is formed (which is called ranking).

Initially, most of the ranking models used were hand-weighted combinations of all document signals. Signal sets can include PageRank, click data, relevance information, and more .

So that life does not seem like honey, many of these signals, for example, PageRank and signals generated by statistical language models , contain parameters that significantly affect the operation of a signal. And they also require manual adjustment.

In recent years, learning in ranking is becoming increasingly popular - signal-based differential approaches with a teacher. Among the popular LtR, McRank and LambdaRank from Microsoft can be cited as an example, as well as MatrixNet from Yandex.

Also in the field of semantic search and ranking, a new approach based on vector spaces is gaining popularity. The idea is to train individual low-dimensional vector representations of the document, and then build a model that will display queries into this vector space.

In this case, when sampling, you just need to find several documents that are closest to the query vector (for example, for the Euclidean distance) for some indicator. This distance will be the rank. If it is good to construct a display of both documents and queries, then the documents will be selected not by the presence of any simple pattern (for example, words), but by how close the documents are to the query by meaning .

Indexing pipeline control

Usually, in order to maintain the relevance of the search index and the search function, all the considered parts of the conveyor must be under constant control.

Managing the search pipeline can be challenging, since the entire system consists of many moving parts. After all, a pipeline is not only data movement: over time, the module code, formats, and assumptions included in the data also change.

The conveyor can be run in a “batch” mode, on a regular, irregular basis (if you do not need to index in real time), in streaming mode (if you cannot do without real-time indexing) or on certain triggers.

Some complex search engines (for example, Google) use pipelines in several levels - on different time scales: for example, a frequently changing page (the same cnn.com ) is indexed more often than a static page that has not changed over the years.

Service systems

The ultimate goal of a search engine is to accept queries and return appropriately ranked results via an index. The issue of maintenance systems can be very complex and include many technical details, but I still mention a few key aspects of this part of the search engines.

Work speed Users notice delays in the system with which they work. ️ Google conducted extensive research and found that if the response delay increases by 300 ms (100 → 400 ms), then the number of search queries drops by 0.6%. They recommend making sure that most requests have a delay of no more than 200 ms. There is a good article on this topic . And now the difficult part: the system can collect documents from many computers, combine them into a list, which can be very long, and then sort it in the order of ranking. And if this does not seem to you to be difficult enough, then note that the ranking may depend on queries, so when sorting, the system does not just compare two numbers, but performs more complex calculations.
Caching results. If you need to achieve good performance, often without this in any way. ️ But everything is not so simple with the cache. The index has already been updated, some results are blacklisted - and the cache may show an outdated issue. Clearing the cache is also another puzzle: the search system is likely not to be able to cope with the entire stream of requests, having an empty (“cold”) cache, so before receiving requests, the cache needs to be prepared - “warm up” . In general, the cache complicates the functional section of the system, and the choice of its size and replacement algorithm is a difficult task .
The level of performance. Often defined as the ratio “work time / (work time + idle time)”. If the index is in a distributed state, then to produce search results, the system usually has to query each segment for its part of the total result. And this means that if one segment is unavailable, then the whole system does not work. The more machines involved in the maintenance of the index, the higher the likelihood that one of them will stop working and stop the entire system.
Manage multiple indexes. The indices of large systems can be divided simply into pieces (segments), as well as by type of carrier and indexing rhythm (actual and long-term indices). The results of their issuance can be combined.
Merging the results of data of different types. For example, Google shows results from maps, news, etc.

Man rating. Yes, such work in search engines is still needed.

Quality, evaluation and refinement

So, you run your own indexing pipeline and search engines; everything works well. , — .

. — , .

? -, ( ), «» :

, ( ).
( ).
.
(, ).
.

. . — — , .

( ) , . , :

, .
F- ( F₁), .
(MAP) .
(nDCG) MAP, .
«» «» , .
— .

. , — , . , .

— , , .

. , .

, , .

«» , :

: .
: ( ).
: « — ». , , . ( + ) « » . , . .
: , , . . . — MAP nDCG.

, , . ? ? ?

. , . , , «» — , . : , . .

. , . : « ?».

, : , , ? ️ , .

?

. , , :

, — SaaS ( — ). :

«» ( ).
« ». , . , : , ; ; , .
12 — .
. , , , , .

2. , — . «» - Elasticsearch. .

3. , (, HTML-), . , . — Spark .

SaaS

Algolia — SaaS, - API . API , - , . -, Algolia ( ) — .

Elasticsearch : AWS (️ Elasticsearch Cloud ), ️ elastic.co ️ Qbox .
️ Azure — Microsoft. REST API, . Lucene, (. ).
️ Swiftype — SaaS, , Salesforce, G Suite, Dropbox .

Lucene — . , . . C — Lucy .

Solr — Lucene. Hadoop .
Hadoop — MapReduce , Solr. Spark , . ️ EMR — MapReduce AWS.
Elasticsearch — Lucene ( Elasticsearch Solr ). , , ES, : , API , Hadoop , . Enterprise . ES SaaS: , , .
Xapian — C++: , .
Sphinx — . SQL- . MySQL .
Nutch — -. Solr. Common Crawl .
Lunr — - .
SearchKit — - Elasticsearch.
Norch — Node.js LevelDB .
Whoosh — , Python.
OpenStreetMap .

, :

Common Crawl — . AWS — .
OpenStreetMap — , .
N- Google — .
«» — , , , . .
IMDb — .

Literature

Modern Information Retrieval (« »), R. Baeza-Yates B. Ribeiro-Neto — , .
Information Retrieval (« »), S. Büttcher, C. Clarke G. Cormack — , . . .
Learning to Rank (« »), Tie-Yan Liu — LtR. . - LtR- — .
Managing Gigabytes (« ») — 1999 ., - , .
Text Retrieval and Search Engines (« ») — - Coursera. .
Indexing the World Wide Web: The Journey So Far (« : », — PDF ) — - 2012 ., Ankit Jain Abhishek Das Google.
Why Writing Your Own Search Engine is Hard (« — ?») — 2004 ., .
https://github.com/harpribot/awesome-information-retrieval — , Harshal Priyadarshi.
, — Daniel Tunkelang .
.

This was my modest attempt to make at least some useful "map" for those who are starting to develop a search engine. If I missed something important - write.

About the translator

The article is translated in Alconost.

Alconost is engaged in the localization of games , applications and sites in 68 languages. Language translators, linguistic testing, cloud platform with API, continuous localization, 24/7 project managers, any formats of string resources.

We also make advertising and training videos - for websites selling, image, advertising, training, teasers, expliners, trailers for Google Play and the App Store.

Read more: https://alconost.com

Source: https://habr.com/ru/post/339894/

All Articles