News 2.0.1-beta

As already noted here, Sphinx 2.0.1 was recently released. The release happened in easy haste, tk. “Completely unexpected” (approximately as a session or a New Year), in addition, a book for beginners , describing just the new version, was published. The book about the trunk is still too eccentric, so I had to quickly publish the version. It’s good that we prepared for a month or two just to release: we fixed bugs, we didn’t break the features much. In this note I will tell about all the innovations in the latest version 2.0.1 and plans for the next version, see the tackle.

New features

5 of the most noticeable (in my personal opinion) of the 37 new features turned out such.

detection of the boundaries of sentences and paragraphs during indexation, operators SENTENCE, PARAGRAPH when searching;
support of hierarchical (!) zones inside the document during indexing, ZONE operator during search;
A new type of dictionary, which greatly accelerates indexing with the ability to search for substrings (indexing is up to 5-10 times faster, and the index is less);
Improved support for string attributes: ORDER BY, GROUP BY, collations;
Improved SphinxQL syntax: get rid of magic, go to SQL'92.

About indexing sentences and paragraphs (index_sp = 1, SENTENCE, PARAGRAPH)

Need to search with the restriction on the appropriate text unit: (mother SENTENCE soap SENTENCE frame), (uncle PARAGRAPH was ill). A couple of operators have been added to the query language. Operator arguments (those to the left and right of SENTENCE, PARAGRAPH) can be one of three: either a simple keyword; either the phrase ("gone with the wind" SENTENCE circulation); or the operator himself (mother of SENTENCE soap SENTENCE frame).

Arbitrary subexpression to shove there is quite nontrivial for technical reasons. For those who are curious how the match match query is arranged inside: these operators in the query calculation tree receive a stream of keyword entries from subtrees and filter this stream. Accordingly, after filtering for entering into one sentence or paragraph, it is generally impossible to say with confidence that filtered entries still match, or no longer. It is necessary to check anew in some way, it is rather technically difficult (if all of a sudden it is easy for someone, send a working patch and immediately get your salary suggestions). Therefore, it is necessary that the occurrences are atomic (then it either matches, or it does not exist at all, the filter is killed). Well, so either a word or a phrase.

To distinguish the boundaries of sentences and paragraphs, when indexing, they need to be checked and saved into an index. To do this, the index directive is attached = 1 .

The paragraph boundary is considered to be a series of block (block level) HTML tags, written in the code. Therefore, to index paragraphs, you need to enable the HTML stripper at the same time (html_strip = 1), tk. HTML is processed there.

Boundaries of sentences are detected on the basis of the text, for them the stripper must not be included. The boundary is always considered to be question and exclamation marks, and with a few exceptions a dot. Exceptions handle abbreviations like USA or there Goldman Sachs Srl, names like John D. Doe, and the like. They can make mistakes, of course, as we accumulate feedback, we will add all sorts of exceptions. But in general, the tests worked well.

About indexing zones

It happens that people strive to stick a text with some kind of internal structure inside the documents, which does not fall on a simple fixed list of fields. A typical example is a book: chapters, sections, subsections, applications, footnotes, etc.

Here, starting from 2.0.1, there is support for such an arbitrary structure (inside any ordinary field). It is turned on by the index_zones directive, and also uses the HTML stripper, you will need to enable it too. When the document is indexed, the borders of all zones marked with the selected tags will be saved: where each h1 begins and ends, for example, or there appendix. When searching, correspondingly, you can limit your search to zones: (ZONE: h1 hello world).

Zones can be any number. Zones can be arbitrarily nested. Only the tag length is limited, approximately 120 bytes. Not all tags in a row are indexed, but only explicitly indicated in the directive. You can specify either the exact tag or the prefix mask: index_zones = h *, chapter, section, appendix. When searching in the operator, you can specify several zones, similar to the fields: (ZONE: (h1, appendix) hello world). Restrictions on the names there, those. You can index HTML with those zones, you can XML. Unlike XML requirements, it's technically okay if the zones overlap (for example, one twothree fourfive six). It is strictly necessary, however, that every open zone be closed.

About the new dictionary

In the tenth year of development, we finally screwed up the dictionary in which (drum roll) ... the keywords themselves are saved. Previously, only hashes were saved. Actually, the words themselves are still not needed for searching: replacing with hashes works fine. However, with the old type of dictionary (dict = crc) to search for substrings, we had to index all these substrings in advance (since it is a chore to search for substrings in the hash). Such pre-indexing of all possible substrings, by the way, works as fast as possible when searching. However, the time of indexation and the size of the index are strongly affected, it turns out up to 5-10 times slower than the “usual” indexation, the index swells accordingly. For small indexes, caliber 1-3 GB of data (which seems to be enough for indexing a little less than all the torrents in the world), this is still tolerable. For indices 100+ GB text is already gone. Here, in negotiations with a certain client, it turned out that the search for a substring is occasionally needed in such collections. Well, they attached a new dictionary, where to go.

It is enabled with the dict = keywords directive in the index settings. Compared to normal (!) Indexing, it is slower to about 1.3 times (on our internal tests, YMMV, all these things). Accordingly, in comparison with indexation by prefix or, even worse, infixes, it flies 3+ times faster and it eats less space.

By the way, for obvious reasons, a dictionary (.spi file), which is completely cached by default in memory, is significantly less. So it also eats less memory.

For such a hell of speeding up the speed of indexing and saving disk / memory, of course, you have to pay something. Theoretically, the speed of the search should suffer, nothing more. This is because with dict = keywords, each keyword “with an asterisk” automatically expands inside into a large and thick OR of all the words found in the dictionary by the specified mask. The request is more complicated, the processor and disk time is more. The more words the mask VASYA * faces, the longer we will search for Vasya. In practice, however, it may turn out that the new dictionary comes out faster. This is because if the old index did not fit in memory decisively and for every sneeze I walked leisurely to sit in iowait, read the disk, rustle with heads, drain the cache, and only after that leave the syscall ~~toilet~~ , and the new index fits (or at least caching is much better), then new “slow” requests to memory will still be faster than old “fast” requests to the disk. Better ten times at a time than never ten times!

The degree of expansion, by the way, is controlled by a separate new setting, expansion_limit . By default, it is now 0, those. there are no restrictions. If you want requests like A * not to be replaced with OR for a million words and not to kill the demon to death, it is better to set some reasonable expansion_limit. If there is a limit on expansion, top-N of the most frequent words will be taken.

Pro String Attributes

We already had lines, and there was little that could be done with them, it was a mess. Attached support ORDER BY and GROUP BY over string attributes, it's time. (It remains to attach WHERE, but if you have a search by words, this is not the most urgent need.) Sorting and grouping via SphinxAPI also works.

Since the strings are not numbers for you, and depending on the language and / or case-dependency requirements are compared differently, I had to attach more collations , those. on the fingers, different string comparison functions. Since, to manually implement a bunch of collations, let's be frank, the lazy boy is lazy, made the minimum gentlemanly set: binary, utf8_general_ci, libc_ci, libc_cs. The first two compare strings either stupidly byte-wisely, or by the “common” (disregard of language and case) UTF-8 rules, respectively. The second two cynically use libc and locale. I was surprised to find out, by the way, that LOCALE = ru_RU.utf-8 doesn’t work when the daemon is started, plus the locale out of the box is often not particularly installed. Well, I had to attach the collation_libc_locale directive to select the locale when the daemon was started, and study the locale -a command and some other incomprehensible apt-get, yum.

About SphinxQL

In previous versions, SphinxQL was essentially a very light wrapper on top of the “old” search engine available through SphinxAPI. Through this, all sorts of residual phenomena were present: in any result set the columns id, weight were necessarily added; when grouping, magic columns were added @group, count ; the order of attributes explicitly requested in the SELECT could be violated (the one in the query was replaced by the one in the index).

Such phenomena conflict with the standard SQL'92 and common sense, it was decided to clean up. The requests remain the same, but the response is different: no extra magic columns are now added, all WEIGHT () and COUNT (*) should now be explicitly requested. And the order of attributes is now given exactly the requested, not the selected indexer when building the index.

But! Suddenly (tm) you cannot change the behavior and break existing applications, you must give the opportunity to update the application (and then suddenly change and break everything). Through this, the mysterious directive compat_sphinxql_magics appeared , which is equal to 1 by default (give response “as before”, with magic columns), but in the bright future it should be 0 everywhere (give response in a new way, as ANSI SQL bequeathed).

The daemon, at startup, swears warning-th on its own default value. It symbolizes the need to keep up with progress and is conceived.

The list of changes (it is quite small) is in the documentation in the special section about updating SphinxQL , in principle, everything should be quite intuitive. So our goal is good old well-known SQL. Where previously SELECT * FROM ... was written, now we write SELECT *, WEIGHT () `weight` FROM ..., where only GROUP BY was previously written, now we explicitly write COUNT (*) myalias and update the application to go to the myalias column , etc.

New packages

Compiling with your hands is boring, so we are gradually learning to build official binary packages for different platforms. Release 2.0.1 is already compiled under RH / Centos, Win32, MacOS. In plans and tests (I hope, these are the next 1-2 weeks) still packages under Ubuntu, Win64. Further, the fantasy refuses and it will be necessary to conduct a survey, under which I still want official binaries.

Still all sorts of different new features

In the format "gallop across Europe" I will go over a number of other potentially interesting pieces.

We made multithreaded and distributed building of a package of snippets, for the case when you need to quickly build them from large documents lying on the server. It is enabled if there is a dist_threads for the server, load_files in the request for snippets, and so far it works only through the API.

Made support for UDF. You can write functions in C, connect them to the server on the fly, and use expressions in the reader (see SELECT). The interface is quite similar to MySQL, although the type system is different. Rebuilding UDF from MySQL under Sphinx is a rather trivial task.

Made a new log format, query_log_format = sphinxql. All search queries (through both the API and QL) convert to the correct SphinxQL syntax and write, write. Unlike regular logs, it records filters, SELECT statements, errors, etc. Convenient for debugging and profiling. We are going to finish logging all non-search queries too (snippets etc.).

Made the directive blend_mode, so that sequences with blended characters in them are indexed in several different ways. (For example, for the “word” @ sphinxsearch # to index all three options @ sphinxsearch, sphinxsearch #, @ sphinxsearch #, and not just the last one.)

We made a watchdog, so that in a mode with threads the demon would lift itself by the hair, if that. You can disable.

Support for loading id32 indexes into id64 daemons was done. We are going to forcibly include --enable-id64 always, getting ready here.

They made support for multi-queries, UPDATE for attributes, DESCRIBE, SHOW TABLES, DELETE ... WHERE id IN (...) and all sorts of nice little things in SphinxQL.

The English stemmer was optimized several times, with morphology = stem_en, indexation accelerated by about 1.3 times.

Snippets on large and thick files were optimized, together with the stemmer it was 1.5 times or even 2 times faster.

Well, and bugs fixed without an account, of course.

What (and when) to wait in 2.0.2

I'd like to break the vicious tendency to release releases once a year, and the rest of the time to live in a trunk. Therefore, 2.0.2 with a ~~fool and poetess~~ bugfixes and small new features are going to be released this summer. There are automatic tests, automatic assemblies for all kinds of platforms are doing. Now someone else would automatically write the documentation, and then absolutely hoo.

In addition to the always present tasks “bugs and speed” (making bugs and eliminating speed, aha) there are short-term plans for a number of features for RT (support for substring search, MVA, something else), some secret work on search quality, etc. . "Minor" improvements. Which, by the way, smoothly brings.

How to shout in your ear

Features in the Sphinx usually appear in four different ways. First, sometimes they samozarozhdayutsya from dirty clothes and a handful of wheat, but this in recent centuries (after the abolition of ether and phlogiston) occurs very rarely. Secondly, sometimes we ourselves sit and think: and not to make us just such a feature? And do not, of course. We think better and do another one. Thirdly, sometimes a client comes in and says: I really want a feature, I’m even ready to pay money, that's how awful. It is not always possible to persuade the client and greed to overcome, it is necessary to fasten. In-4x, sometimes users take and write: why haven’t you done it yet? We think: is it true, why have we not done the feature so far? We'll sit, think, smoke, and again, do not, of course. But we bring in the plan and then we do it later!

So, in order for the features you need to appear, we need to talk about it, and loudly, clearly and periodically. To do this, you can use the bugtracker twice. First, requests for new features need to be added there. Secondly, you need to subscribe to existing requests (Monitor Bug). On the main page of the bugtracker there is an inconspicuous list of Top Monitored Bugs. We do not yet have another good automated method to determine the importance of the features, so we are going to glance there, and we ask you to “vote” for the features in that bug tracker.

Total

Made a handful of features and release 2.0.1-beta . In parallel, they wrote a book about it for beginners and we are thinking about the next book for advanced guys. We want to defeat the worst enemy (myself) and in the next release to do a lot, but for a change quickly. Can we help with this in every way . Traditionally, we are waiting for any feedback and the characteristic shaft of detected bugs .

So it goes.

Source: https://habr.com/ru/post/118892/

All Articles