Google, of course, is looking good, but it is impossible to post corporate resources for open access, buying google-mini with its limitations is also not an option. And a search on the basis of impressive sizes (4 gigabytes of texts for which a search is needed) is necessary. And if you add to the search by text, and also search by some parameters, then google-mini will not help and it becomes absolutely scary.
But do not panic! Sphinx comes to the rescue - an open source search engine that can be screwed to almost anything without much effort.
The beauty of the Sphinx is that it does not index the whole site, recursively following the links, pulling out information (sometimes completely unnecessary), but the base in a convenient format.
')
The Sphinx consists of 3 parts: the indexer, the search utility for searching from the command line, the searchd daemon to which we will access search queries. Also in the distribution kit there is a set of examples of using the Sphinx in different languages.
For starters, probably worth it to put the Sphinx.
- We download source codes from here
- Unpacking (tar xzf sphinx-0.9.8.tar.gz)
- Configuring (./configure --prefix = / path / to / sphinx)
- Build and install (make install)
Now a little about setting up and working.
We will deal with the search engine on a simple and clear example - a possible simplified structure of Habrahabr =)
Let the topic have the following structure:
Field type
-----------------------
id int (11)
blog_id int (11)
Subject varchar (255)
Content longtext
Well, add to the topic tags, where do without them - Web 2.0, after all
Field type
-------------------
topic_id int (11)
tag_id int (11)
The most difficult task of the search, which we have to solve and with which the Cphinx will successfully cope, is a search through the texts of the topic with some tags in a particular blog.
Now, in fact, we will configure the Sphinx on our base.
Open the config (PREFIX / etc / sphinx.conf). We will need to configure 2 parts:
- Sources
- Indexing options
The source, as the name suggests, is where the Sphinx takes data for indexing. Sources can be of two types: some DBMS (if I am not mistaken, only mysql and postgres are supported now) or an XML database. We work with mysql.
Ask the source
source main
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass =
sql_db = test
sql_port = 3306
# up to this point, I think everything is clear
# now you need to ask the Sphinx the SQl request, with which
# he will pull information from our database
sql_query = \
SELECT id, subject, content FROM Topics;
# and the request by which you can get the topic.
sql_query_info = SELECT * FROM Topics WHERE id = $ id
}
There is a source, now let's create an index. There may be a lot of indices, but so far we only need one.
index main
{
source = main # use source main
path = / var / lib / sphinx / main # the place where the index will be stored
docinfo = extern # We need this later
morphology = stem_ru # Define morphology so that the query "habrahabr"
# we also had "habrahabra", "habrahabr"
min_word_len = 1 # minimum word size. Suddenly want to look for pretexts?
charset_type = utf-8 # this is understandable - encoding
html_strip = 0 whether to cut HTML.
The documentation says that it works correctly only on
# perfect html'e, but to be honest, never tried
}
The parameters of the indexer and the search daemon can be left as default.
Run indexer
bin / indexer - all
The option --all means, then you need to re-index all indexes.
The index is built, you can try to search. Let's use the search utility.
bin / search habrabra
And get a list of all documents, which mentioned habrahabr.
The search engine is ready! Now we will solve the problem about which it was mentioned earlier: search on topics with specific tags in a particular blog.
To do this, the Sphinx needs to know which tags belong to which topics and which blog contains which topic.
You can add any number of attributes to each entry. We have two of them: a topic and a blog. We speak to the Sphinx about it. Add the following lines to the source config:
# tags
sql_attr_multi = \
uint tag from query; \
SELECT topic_id, tag_id from tags;
# blogs
sql_attr_multi = \
uint blog from query; \
SELECT id, Blog from Topics;
After that, we re-index the database and use the search utility to check the result:
bin / search -f tag 42 search words
Sphinx will give us a list of topics marked with tag number 42 and containing “search words”.
Unfortunately, the search utility does not know how to limit the search by several attributes, but through the API it is possible (If there is any resonance, I will describe the API separately).
I would like to add that the Sphinx has a rather powerful “language” of requests to it. In the extended solution, you can specify which columns to match with, use expressions with brackets, sort and group search results, set the sorting function on your own, and much more.
Full documentation in English is
here .
Well that's all.
PS First topic. Debut a success? =)