We tested here, compiled binary packages and
laid out a version of Sphinx 2.0.2-beta (this is such an open-source search engine used on a bunch of websites), planned for mid-December (a revolutionary change!) To release
Sphinx 2.0.3-release , and We are also diligently preparing for the
(free) meeting of the users of the Sphinx on December 4 in St. Petersburg . You need to register for a meeting via the link a little higher,
submit a cool report through our contact form, and a number of details about those ~ 30 new features and plans / deadlines for the upcoming releases and their cycle can be read under the cut.
About features
It is clear that most of those 30 features are rather pleasant things. Well, there is a new flag that includes the ability to replay logs in a situation where for some reason the time jumped on the server, support for snippets spread on all the cluster (and not copied on each machine) snippets, support for 256 text fields (not 32), and t .P. But again there are a few pieces that we consider relatively large and important as a whole.
Attached support for
MVA64 attributes . “Classic” MVA is a set of 32-bit unsigned values, the new MVA64 is, respectively, a set of 64-bit signed ones. I can remember two obvious applications without thinking: a) eliminate the possibility of collisions, if the MVA kept the CRC32 from the lines, b) save any additional data there, but I am sure that you will find a lot of less obvious and more interesting applications. MVA64 is supported in both disk and RT indexes.
By the way,
MVA attributes are now also supported in RT indexes , as well as
index_exact_words . In general, all the possibilities that are previously absent in RT are done here in stages.
')
Made
support for dict = keywords in RT indexes . This means that
now in RT indexes there is a search by keyword beginnings (word *). The min_prefix_len, min_infix_len directives that previously pre-
indexed all possible substrings that previously existed in disk indexes were not specifically done: it is a strong indexation blow everywhere, but in the case of disk indexes this is in addition a blow to a (relatively large) disk, and in the case of RT precious memory, which is always lacking. If you inflate at times the requirements for a disk to search for substrings, I somehow somehow agreed, then there is no memory requirement. Well, here with the advent of dict = keywords and the search for substrings is possible, and the memory is intact.
Another new interesting thing is
ATTACH INDEX . It now allows you to take a disk index full of data, define a new empty RT index, and convert the disk index to RT. After that, the data from the disk index disappears, but it appears in RT, and then you can safely work with RT as usual. Quite convenient for quick initial import of large amounts of data, well, or for prompt recovery of RT, if suddenly something happens to it (pah-pah-pah): it’s clear that reindexing the disk index in one stroke is much faster than inserting records into RT one by one and even a few pieces. Physically, the operation translates into just renaming files, so it is very fast. In fact, the functionality implemented right now (one-time conversion) would be more correct to call CONVERT. But we are planning to develop this thing further and make it possible to import big data data into a non-empty RT index too. Therefore, they immediately scored the key word ATTACH, for the future.
The
UPDATE statement now supports more complete conditions in the WHERE . It became possible to make requests like UPDATE myindex SET deleted = 0 WHERE MATCH ('test'), well, or there ... WHERE vendor = 123. Those. bang a thousand records by the condition just become a thousand times easier. As before the existing update of the values ​​of the columns by ID, this new UPDATE also works in regular and disk indexes.
And finally, the last “big” feature in the list is the ability
to create your own formulas for calculating relevance and set them on the fly ( expression based ranker ) . In previous versions, the options for calculating the relevance available through WEIGHT () essentially came down to a choice of several written in advance by the rankers (PROXIMITY_BM25, SPH04, etc.). It is clear that after this, WEIGHT () could be thrust into expressions and some other attributes of the document added to it, but influence the calculation of WEIGHT () itself and otherwise combine any ranking factors calculated not for the entire document, but for individual fields, it was impossible. And there were not so many factors.
Now you can. The ranking formula can be set at least for each individual query. Plus, the available ranking factors become significantly more. All rankers are successfully emulated by a new “scriptable” ranker. Examples are in the documentation, here I will give one:
$client->SetRankingMode(SPH_RANK_EXPR, "sum(lcs*user_weight)*1000+bm25");
Surprisingly, it works much faster than I expected. I expected slowdowns up to several times, in fact I observe a slowdown from 1.1x to 1.3x times on a small test collection of 1,000,000 blog posts - this is compared to compiled C ++ code, which in addition considers much less factors. I think pretty good.
About development plans
The branch 2.0.x is now frozen, new features will no longer be there, only bug fixes and regular releases with these bug fixes themselves. The closest one is appointed after 1 month, then after that, either again by the hour with an interval of 1-2 months (if enough corrections accumulate), or as it accumulates.
All new features from here will be added to the trunk, the next version is 2.1.1. For him, the release date has not yet been scheduled. But a number of features are already in active development, so you can tease right now. We are already searching for substrings (* word *), and not just at the beginning of a word (word *), using dict = keywords. It is possible (possible) to add support for masks (wildcards) for the same occasion. We are working on an interesting improvement for clusters with a bunch of agents, so that requests are sent to them in parallel (now it is still serialized). Plus, secret work is underway to fix the well-known library and improve support for Russian morphology.
Pro releases
Features with features, in addition to them, we again shook the internal processes of testing, assembling and rolling out releases. It seems to have been shaken up, so the next version, 2.0.3-release will not be downloaded as usual, “when it is ready” - but after the call, after 1 month, in mid-December 2011. If your boss doesn’t tell you to install a version without such a tag, He will be in a month.
You can also tell him that the current tag is, in fact, not beta, and even rc at all. In a sense, there are no known major and serious bugs in 2.0.2-beta at the time of release. For the previously existing test functionality, it has traditionally only become more, so for “just searching” it should be more stable than it was. Therefore, in principle, it could be called the Release Candidate, but I decided not to complicate the set of tags.
We added some new features again, and the policy is such that in this case the Release tag is delayed until, in addition to our internal testing, the version is not tested by live people from the community. So take the new version, try, and
be sure to email us about the bugs , if you suddenly fall for some.
In the morning in the newspaper, in the evening on the Internet In the morning in the bug tracker, in the evening in the trunk!
About the conference
In more detail and about everything new, and the correct use of the old, and, I hope, a lot of other things in the near future, you can not only read in rare blog posts, but also listen live at a
user conference . We arrange the second time, still free (I didn’t learn anything for the first time !!!), but now, for a change, it’s not Moscow, but
St. Petersburg, Sunday, December 4th . We kindly ask readers to register as much as possible in advance, please do not hesitate to write to writers and send us suggestions about reports and / or lightning talks.
Hello everyone, to the new releases and, I hope, live meetings at the conference.