📜 ⬆️ ⬇️

Indexing and searching with Xapian in .NET

If the word Xapian is unfamiliar to you, I recommend reading a small article .
In short, Xapian is a set of tools written in C ++ for indexing textual information, with the ability to search through the database of indexed information. To work does not require an installed server, it is enough to have its libraries. It can process huge amounts of information (tested up to 1.5 TB), measured in millions of documents. It is a competing product of Sphinx and Apache Lucene.
I chose it from these three products for being able to use from .Net.



First of all, you need to download dll 'ki Xapian under .Net.
Then another dll helper , Zlib1.dll , without it, exceptions will be thrown when trying to access a wrapping dll compiled in C ++.
')
Actually for work is all that was required. You can create a project. XapianCSharp.dll is immediately added to the Reference. _XapianSharp.dll and zlib1.dll are added to the project (just as content), and we mark the Copy to Output Directory as Copy always.

We do two functions to test the work:
.... using Xapian; .... //   ,       //    string xapianBase="H:\\XapianDB\\xap.db"; .... //    //  private void IndexFolder(string path) { try { if (Directory.Exists(path)) { string[] files=Directory.GetFiles(tbIndexFolder.Text); using (WritableDatabase database=new WritableDatabase(xapianBase, Xapian.Xapian.DB_CREATE_OR_OPEN)) { using (TermGenerator indexer=new TermGenerator()) { using (Stem stemmer=new Xapian.Stem("russian")) { indexer.SetStemmer(stemmer); foreach (string file in files) { using (Document doc=new Document()) { //    ,       doc.SetData(file); indexer.SetDocument(doc); //      indexer.IndexText(File.ReadAllText(file, Encoding.GetEncoding(1251))); //   database.AddDocument(doc); } } } } } } } catch (Exception ex) { Write("Exception: "+ex.ToString()); } } private void Search(string searchText) { try { //     using (Database database=new Database(xapianBase)) { using (Enquire enquire=new Enquire(database)) { using (QueryParser qp=new QueryParser()) { using (Stem stemmer=new Stem("russian")) { Write(stemmer.GetDescription()); qp.SetStemmer(stemmer); qp.SetDatabase(database); qp.SetStemmingStrategy(QueryParser.stem_strategy.STEM_SOME); using (Query query=qp.ParseQuery(searchText)) { Write("Parsed query is: "+query.GetDescription()); enquire.SetQuery(query); //         //     100  MSet matches=enquire.GetMSet(0, 100); Write(String.Format("{0} results found.", matches.GetMatchesEstimated())); Write(String.Format("Matches 1-{0}:", matches.Size())); //  MSetIterator m=matches.Begin(); //   while (m!=matches.End()) { Write(String.Format("{0}: {1}% docid={2} [{3}]\n", m.GetRank()+1, m.GetPercent(), m.GetDocId(), m.GetDocument().GetData())); ++m; } } } } } } } catch (Exception ex) { Write("Exception: "+ex.ToString()); } } 


Write function write to your taste.

Now we create a directory with text files, or use an existing one, call IndexFolder (directory_name), wait for the files to be indexed. And we can call Search by passing a string with search keywords separated by a space.

Testing.

Iron configuration:
Intel Pentium III 996Mhz
Ram 256Mb

Number of indexed files: 641489
The volume of indexed files: 2,38Gb

File indexing time: more than a week (remember about hardware, at 4x Core the operation will most likely take several hours, the swap also reduces performance by times, respectively).
Load during indexing
image

Table of average search time
Number of wordsSearchTime
oneone1883 ms
one228 ms
one331 ms
2one175 ms
2236 ms
2341 ms
3one1074 ms
3235 ms
3337 ms


The indicators are very optimistic, especially for such a weak car.

Sources:
Article on codeproject
Official site

Source: https://habr.com/ru/post/113381/


All Articles