Today, on this fine spring day, I want to write not only about video search, but also about technical
Implementing work with Sphinxsearch in a loaded Django project.
It is worth starting, perhaps, with setting a business problem:
And non-functional requirements:
About how sphinxsearch is used in Rutube and there will be a given story.
When they talk about searching on the Internet, it usually means looking for textual information. In the case of the video is much worse. Usually, a person has in mind a very specific visual image, which itself translates into the text of the request. Other people who uploaded videos to the site translated the contents of the video into the title and description of the video, and it’s good if it isn’t "test", "sdfsdf" or "111". In any case, only a minimum of textual information is available, and sometimes some metadata provided by the editors and partner users. So if you are a search programmer, the questions “why, on request,“ rp ”are not searched,“ Real Guys ”will haunt you at night. A special test utility helps us to answer such questions.
This is a page that, by a search query, returns not only videos with all the fields in the index, but also information from the rancher with the values ​​of all the characteristics for each document. For this, PACKEDFACTORS()
, SNIPPETS()
and CALL KEYWORDS
requested. Data from this page is usually enough to mathematically substantiate what is called irrelevant (here it is, this is a magic word that only programmers understand in a mathematical sense, and all the rest - in a spiritual one!) ... So, to justify why the irrelevant video turned out above relevant .
Since Sphinxsearch supports a mysql client, then why not take a backend for Django, which would build the necessary search queries, and then return the results as Django models? I do not suggest everyone to try to do this , but it really is not so scary. And for those who are still scared, or just not interested, we suggest moving directly to the next section.
As usual, there is something useful on the githaba. The same django-sphinx-db , for example, helped get started with the engine directly from django models. In Django-1.8, the private part of the implementation of database backends was greatly changed, which made django-sphinx-db
transfer problematic. As a result, the django-sphinxsearch project appeared, which lacks a bit of attention from the side of development, but which is already used in production. Here, by the way, is an example of the difficulties with backend support: a new version of Django is coming out, and everything is falling apart, because "my guts are changing, I want." So you have to start from the beginning.
It looks like this:
django.db.backends.mysql
as django.db.backends.mysql
.SQLCompiler
generate code that is compatible with the base under which the backend is written. This concerns the use of quotes, the ability to specify table names without specifying the schema, the LIMIT / OFFSET syntax, and so on. Here I want to say a separate "fe" to the developers of Django for the fact that the SQLCompiler.as_sql
method, collecting a string from the QuerySet.query, is a monolith of almost 100 rows; as a result, in order to change LIMIT OFFSET
to LIMIT start, end
has to be run regularly on the result of calling the base class method.SphinxQuerySet.match
adds to self.query a self.match structure containing the data necessary to build an SphinxQL expression. The match field is cloned, modified in the QuerySet, and finally used in SphinxWhereNode.make_atom
to generate part of the query string. Nothing complicated, you just need to write tests and have a good debugger on hand.Search results are usually sorted by how they match the search query. How to calculate it? For example, you can take the number of words that are simultaneously present in the document and the query. The more words at the intersection, the more accurately the result fits this query. You can not just take the number of matching words, but also take into account their sequence. And if for each word to take into account its “rarity”, it is generally great: the presence of prepositions and conjunctions in the query and in the document will no longer affect the issue. There are a lot of different, useful and not so invented such characteristics , so in the general case it is reasonable to use a weighted sum of the values ​​of all the characteristics that the engine considers.
In addition to the characteristics that associate a specific document with a search query, you can, regardless of the query, add additional weight to documents with certain characteristics. For example, to increase the issuance of videos loaded in good quality. Or add weight to fresher or more watched videos.
So add to the request
SELECT weight() + a * view_count + b * age as my_weight, ... OPTION ranker=expr('...') ORDER BY my_weight DESC;
and the sorting order of the issue is under your complete control.
So uncomplicated wound:
(the editors hint that these "twists" are not enough for them)
If the tuning of the search query is not enough, you can nail the results with nails. For some queries, this is generally a critical functional, so if you want, you do not want it, but you have to implement mechanisms for manipulating search results.
QuerySet.iterator()
method so that the one in the "normal" state produces results from sphinxsearch, and in some places the same "nailed" clips (for example, we have an episode from “Real Boys. ”No comments).Talk a little about the fact that you can not do in sphinxsearch. The strangest, and at the same time, explainable "it is impossible": it is impossible to issue all issuance except for one or several documents. Just fullscan can be done, and fullscan WHERE MATCH('~document_id')
is impossible. Software prohibits, they say, inefficiently.
There are two limits on the limits: first, SELECT *
without specifying LIMIT returns 20 results, approximately as repr(queryset)
; the second, to find the 100,500th element, you need to add to the query OPTION max_matches=100500
. Inside the engine, a partial sort, the default window size of which is 1000. As a result, a request for a larger offset is an error.
There are many strange restrictions on numeric operations with attributes. For example, you can write float_field <> 3.1415
in SELECT
, but not in WHERE
. What can you do, especially the parser. Struggling through QuerySet.extra()
.
The most annoying "no": you can not rely on the search does not crash in the most unpleasant moment. We had a case when searchd restarted immediately after receiving a request containing the number "13". This is especially unpleasant on a page where search results are not the main content. We managed the generator, which in the case of OperationalError quietly and peacefully returns an empty response.
In a situation where there is a lot of data on the site and they change very often, you cannot just take and index the entire site every 5 minutes. You have to be smarter. For those who have "eaten a dog" in the search, this section will not be very interesting, since things are mostly well-known, but I will describe it briefly and to the point:
main + delta + killlist. main - the main index, contains the entire site, updated once a day. delta - contains only documents that have been updated since the last indexing of the main index. killlist - a list of documents to exclude from the previous index.
# IP sql_query_pre = set @ip = substring_index(user(), '@', -1); # delta- KILL- # , delta- + # , main sql_query_killlist = select id from ... where ... and last_updated_ts > @last_index_ts
max_predicted_time
option, which limits the theoretical query execution time. Relevance suffers, but you can cut off some of the problems. It is possible, but not necessary, since "God sees everything." God is the editorial board, and everything is the appearance, for example, of very strange "similar" ones on the page. To combat temporary overloads, it makes sense to put CONN_TIMEOUT
in the Django connection so that the search does not slow down everything else.Still, sphinxsearch is not a database. But the possibility of changing the data in it.
UPDATE
allows you to update attributes that have a fixed length. By the way, even for "on-disk" indices.REPLACE
used, which places the old version of the document deleted, and adds a new one to the end.queryset.update(field=value)
works only for numeric attributes, REPLACE
should be formatted as bulk insert; secondly, REPLACE
is more similar in syntax to INSERT
, and therefore it is necessary to generate it using SQLInsertCompiler
. In general, there is something to think about.After three years of using sphinxsearch in production, the entire search team loved him and loved it with all my heart. Perhaps this is the only project in which all the problems are so strange and entertaining :)
And still sphinxsearch in our Rutube is used for storing and processing logs in the Kibana cycling counterpart - by the way, it works quite quickly. There will be time - we will tell about it.
Source: https://habr.com/ru/post/281653/
All Articles