📜 ⬆️ ⬇️

Why we created a replacement for the old document search systems

Since the late 2000s, we have been engaged in the automation of processes in the security services of large companies. In almost all companies, one of the key security objectives was to check potential customers and counterparties for reliability. The check included a regular search for information about companies or people in a huge array of textual information. This array was (and still is) a few tens of millions of documents in different formats and from different sources. These could be references, reports, extracts in pdf, doc, xls, txt formats, sometimes scans in the same pdf, tiff, etc. In general, the task of quickly finding information about a company or person in this dataset is crucial for any business.


We have come a long way from using dtSearch to a full-fledged own solution. In this article we want to share our experience.


To automate the verification process, we used our own solutions, but we had dtSearch for full-text search in documents. A little bit about our choice (which was held in 2010 and remained with us until the autumn of 2016):



What happened next


Years passed, our system developed, and gradually dtSearch became a narrow and problematic place:



The list goes on and on, but everything else is trivial, compared to the problems listed above.


Therefore, at a certain point, we realized that it was impossible to live this way anymore, and we needed to look for alternatives or create our own solution. The search for alternatives, to our great regret, did not bring anything sensible, the products that existed in 2010 did not advance much, and the new ones that appeared (LucidWorks Fusion, SearchInform, etc.) did not impress us at all.


Next, we looked at the option of creating a full-text search module for our system using Apache Tika + ElasticSearch or Apache Solr, which would generally solve our problem. However, we did not cease to torment the idea that there is still no good solution on the market with quick search, OCR and convenient interfaces.


Therefore, without hesitation, we decided to create our own open-source solution that would make life easier for everyone - Ambar was born that way.


Ambar - full-text document search system


Ambar interface


In the development process, we kept in mind all the problems that we were pursued with dtSearch. Therefore, our main system requirements were: easy, intuitive, powerful, and scalable. We were guided immediately to volumes of tens and hundreds of millions of files, the prerequisite was a quick search that takes no more than half a second, regardless of the complexity of the request and the number of documents.


The release took place in January 2017. Then we launched Ambar at the first major client.


The main points about our system that are important to know:



In the near future, we plan to add the ability to read and index the contents of the mail and start developing the analytical part of the system by adding recognition of named entities (full name, addresses, document numbers, identification numbers, telephones).


→ Project Description and Contacts


→ Project page on GitHub


→ Our blog , where we share all the interesting facts and developments


Thanks for attention!


')

Source: https://habr.com/ru/post/325786/


All Articles