📜 ⬆️ ⬇️

About search video say a word

Today, on this fine spring day, I want to write not only about video search, but also about technical
Implementing work with Sphinxsearch in a loaded Django project.


Search


It is worth starting, perhaps, with setting a business problem:



And non-functional requirements:



About how sphinxsearch is used in Rutube and there will be a given story.


About the relevance of video and mathematics


When they talk about searching on the Internet, it usually means looking for textual information. In the case of the video is much worse. Usually, a person has in mind a very specific visual image, which itself translates into the text of the request. Other people who uploaded videos to the site translated the contents of the video into the title and description of the video, and it’s good if it isn’t "test", "sdfsdf" or "111". In any case, only a minimum of textual information is available, and sometimes some metadata provided by the editors and partner users. So if you are a search programmer, the questions “why, on request,“ rp ”are not searched,“ Real Guys ”will haunt you at night. A special test utility helps us to answer such questions.
This is a page that, by a search query, returns not only videos with all the fields in the index, but also information from the rancher with the values ​​of all the characteristics for each document. For this, PACKEDFACTORS() , SNIPPETS() and CALL KEYWORDS requested. Data from this page is usually enough to mathematically substantiate what is called irrelevant (here it is, this is a magic word that only programmers understand in a mathematical sense, and all the rest - in a spiritual one!) ... So, to justify why the irrelevant video turned out above relevant .


Django-backend search database


Since Sphinxsearch supports a mysql client, then why not take a backend for Django, which would build the necessary search queries, and then return the results as Django models? I do not suggest everyone to try to do this , but it really is not so scary. And for those who are still scared, or just not interested, we suggest moving directly to the next section.


As usual, there is something useful on the githaba. The same django-sphinx-db , for example, helped get started with the engine directly from django models. In Django-1.8, the private part of the implementation of database backends was greatly changed, which made django-sphinx-db transfer problematic. As a result, the django-sphinxsearch project appeared, which lacks a bit of attention from the side of development, but which is already used in production. Here, by the way, is an example of the difficulties with backend support: a new version of Django is coming out, and everything is falling apart, because "my guts are changing, I want." So you have to start from the beginning.


It looks like this:


  1. Looking for PEP-0249-compatible connector to the database. MySQL-python, psycopg, python_monetdb - depends on what Django is screwed to.
  2. The most congenial backend is taken and inherited. In the case of sphinxsearch, this is MySQL, also django.db.backends.mysql as django.db.backends.mysql .
  3. The most difficult thing is to teach SQLCompiler generate code that is compatible with the base under which the backend is written. This concerns the use of quotes, the ability to specify table names without specifying the schema, the LIMIT / OFFSET syntax, and so on. Here I want to say a separate "fe" to the developers of Django for the fact that the SQLCompiler.as_sql method, collecting a string from the QuerySet.query, is a monolith of almost 100 rows; as a result, in order to change LIMIT OFFSET to LIMIT start, end has to be run regularly on the result of calling the base class method.
  4. QuerySet methods are added that provide search-specific functionality. For example, SphinxQuerySet.match adds to self.query a self.match structure containing the data necessary to build an SphinxQL expression. The match field is cloned, modified in the QuerySet, and finally used in SphinxWhereNode.make_atom to generate part of the query string. Nothing complicated, you just need to write tests and have a good debugger on hand.

Ranking search results


Search results are usually sorted by how they match the search query. How to calculate it? For example, you can take the number of words that are simultaneously present in the document and the query. The more words at the intersection, the more accurately the result fits this query. You can not just take the number of matching words, but also take into account their sequence. And if for each word to take into account its “rarity”, it is generally great: the presence of prepositions and conjunctions in the query and in the document will no longer affect the issue. There are a lot of different, useful and not so invented such characteristics , so in the general case it is reasonable to use a weighted sum of the values ​​of all the characteristics that the engine considers.


In addition to the characteristics that associate a specific document with a search query, you can, regardless of the query, add additional weight to documents with certain characteristics. For example, to increase the issuance of videos loaded in good quality. Or add weight to fresher or more watched videos.


So add to the request


 SELECT weight() + a * view_count + b * age as my_weight, ... OPTION ranker=expr('...') ORDER BY my_weight DESC; 

and the sorting order of the issue is under your complete control.


So uncomplicated wound:



QuerySet.iterator ()


(c) Soyuzmultfilm
(the editors hint that these "twists" are not enough for them)


If the tuning of the search query is not enough, you can nail the results with nails. For some queries, this is generally a critical functional, so if you want, you do not want it, but you have to implement mechanisms for manipulating search results.


  1. We are looking for whether there are videos for the current query that the editors would like to see in the search results; we get the positions that they are occupied.
  2. We are requesting a search, with the exception of the "nail" results.
  3. We change the QuerySet.iterator() method so that the one in the "normal" state produces results from sphinxsearch, and in some places the same "nailed" clips (for example, we have an episode from “Real Boys. ”No comments).
  4. If absolutely search for irrelevant results gives, you can, for example, instead of similar clips to issue something from the database, for example, a list of the series of the same series. For this, it is enough that the main Video model matches the fields with the SearchVideo search result model.

Technical limitations


Talk a little about the fact that you can not do in sphinxsearch. The strangest, and at the same time, explainable "it is impossible": it is impossible to issue all issuance except for one or several documents. Just fullscan can be done, and fullscan WHERE MATCH('~document_id') is impossible. Software prohibits, they say, inefficiently.


There are two limits on the limits: first, SELECT * without specifying LIMIT returns 20 results, approximately as repr(queryset) ; the second, to find the 100,500th element, you need to add to the query OPTION max_matches=100500 . Inside the engine, a partial sort, the default window size of which is 1000. As a result, a request for a larger offset is an error.


There are many strange restrictions on numeric operations with attributes. For example, you can write float_field <> 3.1415 in SELECT , but not in WHERE . What can you do, especially the parser. Struggling through QuerySet.extra() .


The most annoying "no": you can not rely on the search does not crash in the most unpleasant moment. We had a case when searchd restarted immediately after receiving a request containing the number "13". This is especially unpleasant on a page where search results are not the main content. We managed the generator, which in the case of OperationalError quietly and peacefully returns an empty response.


Under load


In a situation where there is a lot of data on the site and they change very often, you cannot just take and index the entire site every 5 minutes. You have to be smarter. For those who have "eaten a dog" in the search, this section will not be very interesting, since things are mostly well-known, but I will describe it briefly and to the point:


  1. main + delta + killlist. main - the main index, contains the entire site, updated once a day. delta - contains only documents that have been updated since the last indexing of the main index. killlist - a list of documents to exclude from the previous index.


     #  IP   sql_query_pre = set @ip = substring_index(user(), '@', -1); #  delta-  KILL-  # ,    delta- + # ,     main  sql_query_killlist = select id from ... where ... and last_updated_ts > @last_index_ts 

  2. global_idf. If there are several local indexes, the global_idf = 1 parameter should be specified; otherwise, the Inverse Document Frequency will be counted separately for each index, as a result of which “rare” words from the “delta” will push up the results that should not be there.
  3. Multiple search engines. We didn’t get tricky with replicating search engine index files, just each server indexes the database separately. Out of sync data happens, and even a more stable solution was invented, but so far the hands have not reached. Solution: RT-index, which is updated simultaneously on all search engines when a message is received about a change in a document. Pros: instant indexing, no data rassinchron in the normal state; cons: wildly complex code initiation of sending messages, because The document in the search contains data about 15 database tables, the need for message handlers on each search server.
  4. The load on the server. Of course, it is better not to bring up to 100% CPU Load, but if this has become the norm, then there is the max_predicted_time option, which limits the theoretical query execution time. Relevance suffers, but you can cut off some of the problems. It is possible, but not necessary, since "God sees everything." God is the editorial board, and everything is the appearance, for example, of very strange "similar" ones on the page. To combat temporary overloads, it makes sense to put CONN_TIMEOUT in the Django connection so that the search does not slow down everything else.
  5. Administration of lists of synonyms and exceptions. We decompose on WebDAV and apply at the next indexation (with the rotation of indexes).

About changing RT-indexes


Still, sphinxsearch is not a database. But the possibility of changing the data in it.



After three years of using sphinxsearch in production, the entire search team loved him and loved it with all my heart. Perhaps this is the only project in which all the problems are so strange and entertaining :)



And still sphinxsearch in our Rutube is used for storing and processing logs in the Kibana cycling counterpart - by the way, it works quite quickly. There will be time - we will tell about it.


')

Source: https://habr.com/ru/post/281653/


All Articles