📜 ⬆️ ⬇️

What every developer should know about the search

Want to implement or refine the search function? That way.



Ask the developer: “ How would you implement the search function in your product? "Or" How to create a search engine? ". Probably, in response, you will hear something like this: "Well, we will simply launch the Elasticsearch cluster: with the search today everything is simple."

But is it? In many modern products , the search is still not the best implemented . This search engine specialist will tell you that few developers have a deep understanding of how search works, and this knowledge is often necessary to improve search quality.

There are a lot of open source software packages, a lot of research has been done, but only a select few understand how to do a functional search. No matter how funny, but if you search the Internet related to the implementation of the search information, you will not find relevant and meaningful reviews.
')

Purpose of the article


This text can be considered a collection of valuable ideas and resources that can help create a search function. The article, of course, does not claim to be exhaustive, but I hope that your feedback will help to finalize it (leave comments in the comments or contact me).

Based on the experience of working with universal solutions and highly specialized projects of various scales (in Google, Airbnb and several start-ups), I will talk about some popular approaches, algorithms, methods and tools.

Underestimating and misunderstanding of the scope and complexity of the search task can lead to bad impressions for users, developers will waste time and the product will fail.

Transferred to Alconost

If you are not eager to move on to practice or you already know a lot about this topic, you may need to skip to the tools and services section right away.

Some common thoughts


The article is long. However, most of the material in it is based on four basic principles :

Search is a very heterogeneous task:



Of great importance are the quality, performance and organization of work:



First of all - existing technologies:



️ Read in detail what you are buying:



Theory. Search task


Each product has its own search. The choice of solution depends on a variety of technical specifications and requirements. It is useful to determine the key parameters of a specific search task :

  1. Size: how big will the data package be (full set of documents to search for)? Thousands, millions, billions of documents?
  2. Information medium: Will the search be by text, images, links on graphs or geospatial data?
  3. Control and quality of the data corpus: are the sources of the documents under your control - or do you receive them from a third party (a likely competitor)? Are the documents fully prepared for indexing or do they need to be cleaned and taken away?
  4. Indexing speed: do you need indexing in real time or is indexing enough in batch mode?
  5. Query language: will the queries be structured - or do you need to understand and unstructured?
  6. Query structure: will the queries be text, in the form of images, sounds? Perhaps it is postal addresses, identification records, people's faces?
  7. Considering the context: do the search results depend on who the user is, what is his history of working with the product, where is it, what time is it now, and so on?
  8. Tips: Do you need support for incomplete queries?
  9. Delay: what are the requirements for service delays? 100 milliseconds or 100 seconds?
  10. Access Control: Is the service completely open - or should users see a limited subset of documents?
  11. Compliance with regulations: are there any restrictions on the part of the legislation or the organization?
  12. Internationalization: Do you need support for documents with multilingual encodings or Unicode? (Hint: always use UTF-8, and if you don’t use it, you should know exactly what you are doing and why.) Will you need to support a multilingual corpus? And multilingual requests?

If you think over the listed questions in advance, it will help to make an important choice in designing and creating separate components of a search engine.


Conveyor indexing in work.

Theory. Search engine


It's time to go through the list of subtasks in building a search engine, which are usually solved by separate subsystems that make up the pipeline: in other words, each subsystem receives output data from previous subsystems and provides input data for subsequent subsystems.

This leads us to an important property of the entire ecosystem: by changing the work of any subsystem, you need to evaluate how it will affect the subsystems that follow it, and, perhaps, change their behavior too.

Consider the most important practical tasks that will have to be addressed.

Index selection


We take a set of documents (for example, the entire Internet, all Twitter messages or a photo on the Instagram service), select a potentially smaller subset of documents that are worth considering as search results and only include them in the index, discarding the rest. It almost does not depend on the choice of documents for display to the user and it is necessary for the index to be compact. For example, the following document classes may not be suitable for an index.

Spam


Search spam of various shapes and sizes is a voluminous topic, which in itself is worthy of a separate guide. Here is a good overview of the taxonomy of Internet spam .

Unwanted documents


With some restrictions on the search scope, filtering may be required: you will have to drop pornography , illegal materials, etc. The relevant methods are similar to spam filtering, but may also include special heuristic algorithms.

Copies


Including almost copies and redundant documents. Here, hashing with location sensitivity , a measure of similarity , clustering methods, and even click data can help. Here is a good overview of such methods.

Useless documents


The definition of utility depends on the area of ​​work search, so it is difficult to recommend specific approaches. The following considerations may be useful. Probably, it will be possible to build a utility function for documents. You can try heuristics; or, for example, an image containing only black pixels — like a sample of a useless document. Usefulness can be assessed based on user behavior.

Index building


Most search engines retrieve documents by means of a reversed index , which is often called simply an index.


Analysis of requests and selection of documents


Most popular search engines accept unstructured queries. This means that the system must extract the structure from the query itself. In the case of a reverse index, retrieve the search terms you need using NLP methods.

The extracted terms can be used to sample the relevant documents. Unfortunately, in most cases, requests are not very well formulated, so it is necessary to further expand and rewrite them, for example, as follows:


Ranging


A list of documents (obtained at the previous step), their signals and the processed request are given, and the optimal order of these documents is formed (which is called ranking).

Initially, most of the ranking models used were hand-weighted combinations of all document signals. Signal sets can include PageRank, click data, relevance information, and more .

So that life does not seem like honey, many of these signals, for example, PageRank and signals generated by statistical language models , contain parameters that significantly affect the operation of a signal. And they also require manual adjustment.

In recent years, learning in ranking is becoming increasingly popular - signal-based differential approaches with a teacher. Among the popular LtR, McRank and LambdaRank from Microsoft can be cited as an example, as well as MatrixNet from Yandex.

Also in the field of semantic search and ranking, a new approach based on vector spaces is gaining popularity. The idea is to train individual low-dimensional vector representations of the document, and then build a model that will display queries into this vector space.

In this case, when sampling, you just need to find several documents that are closest to the query vector (for example, for the Euclidean distance) for some indicator. This distance will be the rank. If it is good to construct a display of both documents and queries, then the documents will be selected not by the presence of any simple pattern (for example, words), but by how close the documents are to the query by meaning .



Indexing pipeline control


Usually, in order to maintain the relevance of the search index and the search function, all the considered parts of the conveyor must be under constant control.

Managing the search pipeline can be challenging, since the entire system consists of many moving parts. After all, a pipeline is not only data movement: over time, the module code, formats, and assumptions included in the data also change.

The conveyor can be run in a “batch” mode, on a regular, irregular basis (if you do not need to index in real time), in streaming mode (if you cannot do without real-time indexing) or on certain triggers.

Some complex search engines (for example, Google) use pipelines in several levels - on different time scales: for example, a frequently changing page (the same cnn.com ) is indexed more often than a static page that has not changed over the years.

Service systems


The ultimate goal of a search engine is to accept queries and return appropriately ranked results via an index. The issue of maintenance systems can be very complex and include many technical details, but I still mention a few key aspects of this part of the search engines.



Man rating. Yes, such work in search engines is still needed.

Quality, evaluation and refinement


So, you run your own indexing pipeline and search engines; everything works well. , — .

. — , .

? -, ( ), «» :


. . — — , .

( ) , . , :


. , — , . , .

— , , .

. , .

, , .

«» , :



, , . ? ? ?

. , . , , «» — , . : , . .

. , . : « ?».

, : , , ? ️ , .

?


. , , :

  1. , — SaaS ( — ). :


2. , — . «» - Elasticsearch. .

3. , (, HTML-), . , . — Spark .


.

SaaS


Algolia — SaaS, - API . API , - , . -, Algolia ( ) — .



Lucene — . , . . C — Lucy .



, :


Literature



This was my modest attempt to make at least some useful "map" for those who are starting to develop a search engine. If I missed something important - write.


About the translator

The article is translated in Alconost.

Alconost is engaged in the localization of games , applications and sites in 68 languages. Language translators, linguistic testing, cloud platform with API, continuous localization, 24/7 project managers, any formats of string resources.

We also make advertising and training videos - for websites selling, image, advertising, training, teasers, expliners, trailers for Google Play and the App Store.

Read more: https://alconost.com

Source: https://habr.com/ru/post/339894/


All Articles