Hyper Estraier - a small search engine for the lazy

Small - because in comparison with the Sphinx, the speed of work is really not impressive, but for the lazy - because everything is very simple.
What attracted attention, despite the modest characteristics?
1. The possibility of real-time indexing.
2. The presence of document attributes and their use in searching and sorting the result.
3. Simplicity of work and compact clear documentation (it took a couple of days to study, a quick glance along the diagonal of the docks and was the impetus for a more detailed study of the product).

My impressions of Hyper Estraier :

This search engine is written by FalLabs , one product of which I have recently tested . A short User's Guide (34 screens on a laptop, two thirds of which are descriptions of parameters and configs) with a description of some interesting features was seduced by the experiment.
Personally, I spent about a day studying the description, a few minutes to install plus half an hour to find out how to work with the simplest option, indexing three documents and playing with the results.
Another day was spent writing a program on my lap for indexing an already working volume of data to assess its performance and a day for studying and testing the client-server architecture.
')
Standard installation with default parameters:
$ ./configure
$ make
$ make install
Availability required:
- libiconv - part of glibc;
- zlib - for data compression
- QDBM - the product of the same FalLabs, embedded database. Installation is the same, the above scheme.

Indexing.
To be indexed, the document should be presented in the format “document draft” - its own format ideologically close to the format of the http protocol - header / empty line / text.
The header lists the attributes in the format "@ attribute = value". One line - one attribute.
Text is plain plain text. The file is encoded utf-8.

Job.
1. The simplest option is the command line.
The estcmd utility creates a search index, manages it and allows you to search. When you specify the -vh option, the search result is quite readable; it is issued with snippets in a multipart format. The first line is the block separator. The first block is the header-result of the query - how many documents are there in total, the search time, the number of documents for each of the words, etc. Parsing this issue is easy and pleasant.
For this option, the package has a simple cgi-script, you can search in a more familiar way, through the browser.
If you want something more beautiful in your design - parse the issue with any convenient tool for you.
2. A more complex option is client server.
It is desirable for multiplayer work with a search engine base. Plus, if in the first option with each call, some time is spent on opening the database, then in this case we save on this operation, and of course, caching the last requests significantly speeds up the issue with repeated calls.
Interfaces to search for this option:
- API for C (documentation with the simplest examples);
- web interface - simple search and database management;
- command line utility estcall - actually sends the same http-requests to the server, search results are similar to those described in the previous paragraph.

Speed of work

Testing took place on the same server as last time - Opteron-2218, 2.6GHz, 8G OP, HDD 73G + 143G sas.
This time all the work was done on one of the 143-gigabyte drives.
Initial data - 3224992 posts from forums of one project, totaling about 700 Mb.
Indexing data. The data were uploaded in chunks of 5000 files, converted to utf-8 and indexed.
- the first option, the command line - 6 hours, almost minute per minute;
- the second option - files were individually fed to the server - about 10.5 hours.
Slow? Yes. Compared with Sphynx - turtle. But for the initial filling of the index time is quite acceptable. And how many projects do we have with such volumes? And for the current replenishment of the index with new documents, such speeds are more than enough. I did not find data on LiveJournal, on the main page of Liveinternet.ru. At the moment (11:47, 03/01/2011), according to the diaries, it says “4518 messages in the last hour” is about 75 posts per second, roughly comparable to the resulting indexing speed. according to the second option (85 posts per second). For Lyre, this search engine is no good, but how many sites with similar traffic?

Disk resources:
- the index obtained by the first options takes about 5.3 Gb on disk;
- the index obtained by the second variant is about 6.3 Gb.
Why so - I do not understand. Perhaps it somehow depends on the possibility of simultaneous operation of the server with multiple indexes, the internal name "node" (node).

Search speed.
Unfortunately, I haven’t been able to collect more or less detailed statistics on this issue. I did not arrange mass bombing with inquiries. I can share my subjective feelings:
1. The first requests for a freshly built index are processed for quite a long time - about one and a half seconds.
2. Repeat the same requests, as well as moving between the pages of a given request - no more than 1 hundredth of a second.
3. Verbose queries are processed longer. For example, a request of 5 words (looking for the remnants of the last spam mailing) even spent about 0.17 seconds per page.
I made all the requests through the web interface of the search engine server.

Findings.

The power of this search engine is enough for most sites, with the exception, perhaps, of large media, LJ-level blogs and the like.
Installation, configuration and operation with it is quite simple and does not require high qualifications and an individual.
In fact, with the help of Hyper Estraier you can index any documents from where you can draw text. References to some parser programs of other formats are given in the documentation. Also available own crawler for indexing web pages.

I'm going to run it in a group of forums with traffic of 10-20 thousand posts / comments per day.

Ps. I considered not all the features of Hyper Estraier. If I understand correctly, it is possible to search by several index-nodes at once, as well as a multi-machine version of work. So it is possible that the real "power" of the engine can be much higher than what I could achieve. For those who like to test-torment, the work remains :)

Source: https://habr.com/ru/post/113084/

All Articles

Hyper Estraier - a small search engine for the lazy

More articles: