According to the
applications of workers , I will tell about two new mega-features in Sphinx. Suggestions for future stories can be sent to the comments.
Both features are added in version 0.9.9-rc2, published in early April 2009. The version is traditionally (too) stable, there are no known serious bugs, tests are passed, and so on. Brave people, as well as commercial customers with contracts for support, are already successfully used in production, despite a different tag from “release”.
1st mega feature. Sphinx now
supports the MySQL network protocol (internal version number 10, which is supported by all server and client versions, starting with MySQL 4.1 and MySQL 5.x inclusive).
')
What does this mean in translation back into Russian? Essentially, a
third method of accessing searchd has emerged , while being particularly simple and affordable. Previously, it was possible to do search queries either through native PHP / Perl / Java / Ruby / ... API, or through SphinxSE. Now, in addition to this, they can be done through
any client for MySQL , including the familiar command-line client, PHP mysql_connect () and mysql_query (), Perl DBI, etc. At the same time, it is compatible with all clients, ranging from MySQL 4.1 to the present.
Those. The API and SphinxSE become optional, a lot (in the long run, everything) can be done simply as if by MySQL queries. In addition, support for persistent connections automatically appears. This is important in the case when very fast average requests are served, but there are
MANY such requests. The overhead of the network connection and fork () typically fit in the interval from 0.001 to 0.01 seconds, depending on the size of the indices, the operating system, etc. In the case of 1M requests per day and 0.001 seconds of overhead to query, this is an extra 1000 seconds of CPU, which does not matter plus or minus. In the case of 50M requests and 0.01 sec / request, we still have about 5 days of CPU time ... there is something to fight for.
A couple of important points about network connections. First,
all connections, regardless of the protocol, are limited to the
max_children limit from above. Secondly, in the current version, connections using the MySQL protocol are considered interactive and the timeout for them is automatically increased to 900 seconds (instead of 1 second using the native Sphinx protocol). Corresponding, but be careful with the limits, otherwise you can accidentally strangle searchd with idle connections.
Enough theory, give practice. How to configure and try? It is necessary to add literally one line indicating the address of the interface, the port and the actual name of the protocol in sphinx.conf:
listen = localhost: 3307: mysql41
After that, restart searchd, and hurray, you can cling to it by a famous client. Pay attention to the key -h 127.0.0.1, under Unixes without it will not take off: libmysqlclient by default connects to a UNIX socket, and not a TCP port, so just -P 3307 is not enough. The easiest way to check is the Server version field.
$ searchd --stop
...
$ searchd
...
$ mysql -h 127.0.0.1 -P 3307
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 0.9.9-rc2 (r1785)
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql>
MySQL client sends the lines to the server without checks and changes, so it is technically possible to come up with some kind of new and absolutely unique query language. However, humanity has long invented SQL, and many people know it a little. Therefore, instead of the next bike with square wheels was attached ...
2nd mega feature. When working through the MySQL protocol,
Sphinx supports the usual SQL syntax . Of course, with a bunch of restrictions; full support for SQL'92 is not yet in question. However, the already supported subset of SQL allows you to do the most important thing: write almost any search queries. Support for the rest of the searchd functionality (updating attributes, creating snippets, etc.) will also be gradually added as the project evolves.
All searches are done using the
SELECT statement , while all the
familiar SELECT clauses are almost completely supported . You can evaluate arbitrary expressions; WHERE, GROUP BY, ORDER BY, LIMIT, and so on are supported. Actually, Sphinx is all that and so has been able for the last two or three years; just now you can also write the request in the usual way. In addition, a number of new Sphinx-specific syntax extensions are supported. Here is an example:
SELECT *, @weight+userkarma*1000 AS myweight FROM mainindex, deltaindex WHERE MATCH('@title hello @content world') AND userid NOT IN (123,456,98,76,54) AND hidden!=0 AND postkarma>=5 GROUP BY userid WITHIN GROUP ORDER BY myweight DESC ORDER BY myweight DESC, postdate ASC LIMIT 100,20
The syntax is somewhat tougher than regular SQL, and not without its rough edges.
- Expressions can be listed only immediately after a SELECT.
- Each complex expression must have an explicit alias specified through AS.
- In this case, COUNT (*) and COUNT (DISTINCT col), on the contrary, cannot be aliased. You will have to refer to them in expressions by the magic names count and distinct, respectively, as well as id and weight .
- In the WHERE / GROUP BY / ORDER BY clauses, expressions cannot be set, only to refer to existing columns or expressions.
- For ORDER BY, you must specify an explicit order (ASC or DESC).
- There are a number of restrictions on the conditions WHERE, TC. they are broadcast directly to the filters; the most noticeable is that OR is not supported, only AND.
However, you can use it anyway, regardless. All of the above are going to slowly correct, bringing the syntax closer and closer to SQL.
There are also specially conceived differences in syntax.
- Enumeration of indices separated by commas means Sphinx-style sampling from several indices, and not SQL-style JOIN.
- The MATCH () special function sends a full-text query and can occur no more than once. If it is not present, it turns on the so-called. full scan mode, which iterates over all existing entries in the index, using WHERE / GROUP BY / ORDER BY.
- There is always an implicit LIMIT, the default is LIMIT 0.20.
- Added extension WITHIN GROUP ORDER BY, which allows you to control which "best" element to choose within the group when using GROUP BY.
In addition to the full SELECT, a number of SQL statements are also simpler.
There is a
SHOW WARNINGS statement that shows warnings generated by a previous query. The error message is immediately returned "and so" and is available through a call to mysql_error (); but warnings will have to be taken out as a separate request. However, it is possible and necessary to launch this separate query, optionally, after checking the result of
mysql_warning_count () or
mysql_info () .
mysql> select * from dist1;
+ ------ + -------- + ---------- + ------------ +
| id | weight | group_id | date_added |
+ ------ + -------- + ---------- + ------------ +
| 1 | 1 | 1 | 1231721236 |
| 2 | 1 | 1 | 1231721236 |
| 3 | 1 | 2 | 1231721236 |
| 4 | 1 | 2 | 1231721236 |
+ ------ + -------- + ---------- + ------------ +
4 rows in set, 1 warning (1.13 sec)
mysql> show warnings;
+ --------- + ------ + -------------------------------- ------------------------ +
| Level | Code | Message |
+ --------- + ------ + -------------------------------- ------------------------ +
| warning | 1000 | index dist1: agent localhost: 3313: connect () timed out |
+ --------- + ------ + -------------------------------- ------------------------ +
1 row in set (0.00 sec)
And this is, by the way, an error.
mysql> select * from test;
ERROR 1064 (42000): unknown local index 'test' in search request
There is an operator
SHOW STATUS , which shows all sorts of different statistics. Counters, of course, different from MySQL; but the issue format is the same. LIKE is not yet supported.
mysql> show status;
+ -------------------- + ------- +
| Variable_name | Value |
+ -------------------- + ------- +
| uptime | 1018 |
| connections | 6 |
| maxed_out | 0 |
| command_search | 0 |
| command_excerpt | 0 |
| command_update | 0 |
| command_keywords | 0 |
| command_persist | 0 |
| command_status | 0 |
| agent_connect | 1 |
| agent_retry | 0 |
| queries | 1 |
| dist_queries | 1 |
| query_wall | 1.123 |
| query_cpu | OFF |
| dist_wall | 1.123 |
| dist_local | 0.100 |
| dist_wait | 1.006 |
| query_reads | OFF |
| query_readkb | OFF |
| query_readtime | OFF |
| avg_query_wall | 1.123 |
| avg_query_cpu | OFF |
| avg_dist_wall | 1.123 |
| avg_dist_local | 0.100 |
| avg_dist_wait | 1.006 |
| avg_query_reads | OFF |
| avg_query_readkb | OFF |
| avg_query_readtime | OFF |
+ -------------------- + ------- +
29 rows in set (0.00 sec)
Some counters in the example above return OFF. This is because searchd was launched without special keys --iostats --cpustats. By default, they are disabled, TC. theoretically add overheads (accurately add unnecessary calls to gettimeofday () and other clock_gettime (), respectively, for example). How many of those overheads it turns out practically, no one has yet measured. I think a great opportunity to be a pioneer!
There are already some known issues. It seems (like) serious so far none. The most amusing in my opinion: they say, some frameworks necessarily send all sorts of unnecessary requests like SET NAMES while connecting, and they don’t manage to die out. Requests, of course, fall; after them and frameworks. Well, reach the hands, add a stub.
But the transfer of some types of SQL queries from MySQL to Sphinx has been particularly simplified. On queries that shovel all the available data, Sphinx is obtained up to 1.5-3 times faster (
once a benchmark ,
two benchmarks ). At the same time, it was necessary to rewrite them to API calls, but now it is not necessary.
And we are only one core,
and you can two .