Why we created a replacement for the old document search systems

Since the late 2000s, we have been engaged in the automation of processes in the security services of large companies. In almost all companies, one of the key security objectives was to check potential customers and counterparties for reliability. The check included a regular search for information about companies or people in a huge array of textual information. This array was (and still is) a few tens of millions of documents in different formats and from different sources. These could be references, reports, extracts in pdf, doc, xls, txt formats, sometimes scans in the same pdf, tiff, etc. In general, the task of quickly finding information about a company or person in this dataset is crucial for any business.

We have come a long way from using dtSearch to a full-fledged own solution. In this article we want to share our experience.

To automate the verification process, we used our own solutions, but we had dtSearch for full-text search in documents. A little bit about our choice (which was held in 2010 and remained with us until the autumn of 2016):

The choice was between Cross, Copernic, Archivist, dtSearch and several exotic solutions.
Comparison of the speed of queries on a large amount of data showed an obvious winner - dtSearch
At that time, dtSearch had the most advanced query syntax, which allowed us to implement all the "subtleties" of information retrieval.
DtSearch has an API in the form of a library for C #, which we used to integrate the engine into our system. Not the most convenient option, but at that time was the most acceptable

What happened next

Years passed, our system developed, and gradually dtSearch became a narrow and problematic place:

Information volumes were continuously growing, along with the search speed was falling, by the end of 2016 some requests took 5 minutes each - an absolutely unacceptable indicator
dtSearch does not recognize scanned documents (OCR), and such documents became more and more - accordingly, they lost a lot of information
dtSearch incorrectly indexes files encoded with CP866
dtSearch does not always correctly tokenize phrases, numbers, dates and words, due to which information may be lost, for example, when searching for composite last names or phone numbers
Our systems gradually moved from the ASP.NET MVC / C # / MSSQL stack to a more modern React / Node.js / Python / ElasticSearch / MongoDB, and you can integrate with dtSearch only through C ++ or C # API, which made it difficult to build integration (I really wanted REST)
For the dtSearch indexer, you had to use a full-featured Windows Server
dtSearch does not know how to work in a cluster, which is important on huge volumes. I had to keep one very thick car specifically for dtSearch

The list goes on and on, but everything else is trivial, compared to the problems listed above.

Therefore, at a certain point, we realized that it was impossible to live this way anymore, and we needed to look for alternatives or create our own solution. The search for alternatives, to our great regret, did not bring anything sensible, the products that existed in 2010 did not advance much, and the new ones that appeared (LucidWorks Fusion, SearchInform, etc.) did not impress us at all.

Next, we looked at the option of creating a full-text search module for our system using Apache Tika + ElasticSearch or Apache Solr, which would generally solve our problem. However, we did not cease to torment the idea that there is still no good solution on the market with quick search, OCR and convenient interfaces.

Therefore, without hesitation, we decided to create our own open-source solution that would make life easier for everyone - Ambar was born that way.

Ambar - full-text document search system

Ambar interface

In the development process, we kept in mind all the problems that we were pursued with dtSearch. Therefore, our main system requirements were: easy, intuitive, powerful, and scalable. We were guided immediately to volumes of tens and hundreds of millions of files, the prerequisite was a quick search that takes no more than half a second, regardless of the complexity of the request and the number of documents.

The release took place in January 2017. Then we launched Ambar at the first major client.

The main points about our system that are important to know:

Super fast search taking into account the peculiarities of the language: for example, a fuzzy search query takes about one hundred milliseconds in more than ten million files
Easy and intuitive interface for both search and administration
Support for all common (and not so) file formats and de-duplication
Best on the market parsing pdf, smart page type detection (scan / text)
Advanced OCR
Advanced full-text analyzer, now you will not lose information due to incorrectly tokenization of dates, phones, etc.
Simple REST API, easy integration with anything
The ability to use the cloud version or installation on your own hardware
When installed on its own hardware, it can be installed in a cluster and scaled to petabytes of data.

In the near future, we plan to add the ability to read and index the contents of the mail and start developing the analytical part of the system by adding recognition of named entities (full name, addresses, document numbers, identification numbers, telephones).

→ Project Description and Contacts

→ Project page on GitHub

→ Our blog , where we share all the interesting facts and developments

Thanks for attention!

Source: https://habr.com/ru/post/325786/

All Articles

Why we created a replacement for the old document search systems

What happened next

Ambar - full-text document search system

More articles: