Today, in the era of Web 2.0, when content on sites is becoming more and more, developers face the challenge of implementing full-text search.
There are few options:
- use widgets from search engine developers (Google, Yandex, etc): easy to implement, familiar to the user interface, morphology support, vocabulary correction, faster site indexing by search engines, but usually limited configuration options and inevitable indexing delay ;
- use the tools built into the DBMS (for example, FULLTEXT-index for MySQL ): it is easy enough to implement, the current search index, full control over the configuration and appearance, but most often very poor performance on large amounts of data, lack of consideration of morphology, or, at worst case, the complete absence of such funds in the database;
- use a separate library / full-featured search system.
The third option seems to be the best, because it combines the advantages of the other two options. The truth is not without flaws here either - the library requires installation, sometimes even starting the daemon (for example,
Sphinx ), which can be unacceptable.
There are many solutions, each has its own advantages and disadvantages. I would like to dwell in more detail on the relatively obscure Xapian library.
')
Overview
This open (GPL) cross-platform library is written in C ++, there are bindings to Python, PHP, Ruby, Perl, Java, Tcl, and C #.
Library features:
- full unicode support;
- Boolean search, search with ranking, by mask, synonyms, there is support for sorting results;
- Stemming support for 15 world languages ​​(including Russian);
- dictionary query support support (for example, xapain request will be replaced by xapian )
- search for documents in the image;
- support for indexing documents in different formats out of the box (PDF, HTML, RTF, Microsoft Office, OpenDocument, even RPM and Debian packages), it is easy to add filters to support the new format.
In a sense, the main disadvantage of Xapian is the binding to programming languages ​​other than C ++.
SWIG is used to generate the binding code, so the API in it completely coincides with
the version for C ++ .
Fortunately for Python there is a simple and effective
Xappy wrapper that takes care of all the dirty work.
Installation
The first step is to install Xapian itself, a binding to Python and Xappy. Most GNU / Linux distributions already have all the necessary packages in the repositories, for example, you need to install the packages in Ubuntu 10.10:
sudo apt-get install libxapian15 python-xapian python-xappy
Xappy is also available via easy_install or pip:
sudo pip install xappy
Indexing
Let's try to index something:
import xappy
When you open a connection for indexing, a new (or already existing) search index database will be created - a folder with a set of files. The base format is independent of the operating system.
After opening, you must specify the properties of the index fields: name, type, and other attributes.
Field type can be:
- INDEX_FREETEXT: the text is stored in the field, you only need to create an index without storing the text itself. For this field type, you can specify additional attributes, in the example language = 'ru' to take into account the morphology of the language and weight = 5 - “weight” of the field when ranking;
- INDEX_EXACT: the exact value of the word is stored in the field (suitable for searching for exact values, for example book identifiers), the text is stored in the index;
- SORTABLE: the field will be sorted. By default, sorting is in a lexicographic format, regardless of what is stored in it. This behavior can be changed through the type = 'date' attribute to sort dates (in the format YYYYMMDD, YYYY-MM-DD or YYYY / MM / DD) and type = 'float' to sort real numbers (in any supported Python format);
- COLLAPSE: grouping will take place across the field (analogous to GROUP BY in SQL, for example, find one document most appropriate for a query in each category);
- STORE_CONTENT: similar to INDEX_FREETEXT, only text is also stored in the index.
To add a document like this code:
Each document should have a unique identifier, in the example above it will be added automatically, but you can specify your own:
After adding documents, it is necessary to write all changes to the disk and close the connection:
connection.flush() connection.close()
Everything, the index is created!
Search
To search for an existing index, you need to open a connection to search the search index database:
import xappy
It is possible that new documents were indexed after the discovery of the search connection. In this case, you need to re-open the connection to gain access to the current database:
connection.reopen()
There are several methods for performing a search query (the
SearchConnection class), the simplest is query_parse:
For fields with the type STORE_CONTENT or INDEX_EXACT, you can display their contents, which allows, for example, not to select the selected records from the main database by ID, and only get by with the search index:
for results_item in results: print(results_item.data['title'])
Related Links
Of course, this is not all what Xapian is capable of. These and other features are discussed in more detail in the Xappy 0.5
documentation , you can also refer to the
official Xapian
documentation and some materials are in
this Xapian
blog in English.