“Active Search”: how we chose the search engine for the DLP system

During the course of work, the DLP system intercepts huge amounts of information every day - these are letters from employees, information about users' actions on workstations, information about file resources stored in an organization's network, and alerts about unauthorized data output outside the organization. But this information will be useful only if a high-quality search engine is implemented in the DLP across the entire array of intercepted communications. Since the first version of our DLP solution was released in 2000, we have changed the search mechanism in the archive several times. Today we want to talk about what technologies we used, what advantages and disadvantages we saw in them, and why we ultimately refused them. Perhaps someone our experience will be useful.

In the most general terms, our requirements for the search engine are as follows: it should be convenient and allow the user to quickly navigate through the entire array of information collected, namely:

To conduct investigations without losing even the smallest details.
Quickly search for information of interest on a variety of parameters.
Have the ability to account for the morphology of the Russian language.

Beginning of the path and basic needs: Oracle ConText, Tsearch2 and Russian context optimizer

In the first steps of the development of the product, when we controlled only corporate mail, the Oracle DBMS was used to store the intercepted data. The search was quite simple, and it was implemented on a standard Oracle ConText (later - Oracle Text), which was supplied as part of the Oracle Database. We have used this solution for many years, because at that time, it fully met our needs.

The user could set a search query for a set of attributes of the intercepted data (letter and part sizes, file names, data types, values of arbitrary mail and mime headers and their parameters, number of attachments) and search the text of letters even despite the lack of Oracle ConText in Russian morphology. The issue of such requests was quite qualitative, but this mechanism had critical shortcomings for us - the low speed of the search queries and the absence of Russian morphology.
')
Later, with the strengthening of PostgreSQL’s market position, we began using it as an alternative to Oracle. Together with PostgreSQL, we obtained the Tsearch2 full-text search module, which already supported Russian morphology and finally allowed us to implement full-text search through captured letters using word forms. This significantly reduced the time that the security officer had to spend on working with the system. It was no longer necessary to perform many queries, independently sorting out possible word forms. In addition, Tsearch worked faster on small data volumes than Oracle ConText, and for some customers it was a more convenient option.

But there were serious drawbacks:

No search for phrases.
The speed of execution of search queries, although it was higher than that of Oracle ConText, still left much to be desired.

In anticipation of new solutions, we did our own research and constantly analyzed the search mechanisms appearing on the market. At some point, we found an alternative to Oracle Text - the product of the Russian context optimizer, which allowed us to build a context index based on Russian morphology. It was a bit slower, but now we could implement a search using various word forms, including the data base managed by Oracle.

Work speed is critical: Sphinx

As time went on, we mastered the interception of new communication channels, began to monitor the actions of users at workstations and monitor the rules for storing documents in the local network. A large number of different types of data appeared, storage volumes increased, due to which the old search mechanisms began to work slowly and inefficiently.

Standard means of DBMS could no longer increase the speed, and the question arose of creating an external index. After a series of studies, the Sphinx search engine (the SQL Phrase Index) was chosen for this purpose. It could be used for databases running both Oracle and PostgreSQL.

With Sphinx, we performed a full-text search. To optimize the search and the ability to create complex queries, we first searched using an external index — we performed a full-text search by word forms, and then we found a search by the attributes of the letters in the database. Thanks to Sphinx, the speed of execution of a number of queries was significantly increased - from tens of minutes (up to hours) to several minutes. But it required major improvements from our side.

In addition, we have seen a number of drawbacks with Sphinx:

One index could only be updated in one thread. Accordingly, it was impossible to build an index in parallel from several processes, and the index simply did not have time for new data. To solve this problem, we had to add functionality that allows us to scale the search, taking into account the parallel construction of the index on several machines.
Support only Russian and most popular European languages. For us, this drawback was not a problem, but if you have plans to work in the international market, keep in mind that the languages supported by the built-in and provided plug-ins will not be enough for you.
Lack of opportunities for analytics - building aggregates, statistics, tops, graphs. We had to create all these analytical functions independently using a DBMS, and not an external index.
The focus of Sphinx on the indexation of web-resources, and not the archives of messages. “This is not a bug, this is a feature,” but it was extremely difficult for us to accept this feature, since for a DLP system this is a completely alien functionality.

We continued to monitor the market for interesting search mechanisms and, among other things, we tested the SOLR search engine on the DLP file storage. In short, compared with Sphinx, SOLR did not demonstrate any advantages, and we stayed on Sphinx.

Not only fast, but also smart: Elastic Search

At some point, Elastic Search drew our attention. He initially possessed many features that we had to implement in Sphinx independently. In particular, Elastic Search supported distributed architecture. He allowed us to perform a really fast search - not in minutes, but in seconds - by creating an index of the most frequently used data. The load tests we conducted showed that the search speed using Elastic Search for 17 million messages is less than a second. At the same time, the user could still make requests for headers and structural information, and this is also implemented by means of the database.

In general, the transition to Elastic Search allowed not only to “overclock” the search, but also to unload the development, which no longer had to refine it, in order to provide users with the necessary tools. At the same time, Elastic Search is not only a search mechanism and storage, but also various analytical tools, such as visualization, log collector, etc.

We believe that a very important direction of DLP development over which we must actively work is the automation of user activity, i.e. security person. And one of the necessary conditions for automation is the availability of ready-to-use data slices for decision-making, which allow you not to manually make standard queries each time, but to immediately receive information from a pre-configured dashboard.

Only by means of Elastic Search, we created a desktop for the manager and administrator with a variety of dashboards, which present the most interesting data from the point of view of information security. This is a decrease in the person’s “ trust level ”, statistics on the most frequently sent files, and uncharacteristic employee contacts regarding his environment. The entire event and incident zone is also implemented on the Elastic Search tools. Here's what it looks like:

Executive desk

Analyst Desktop

Heat map communications

Changing the level of trust in the employee's "file"

In general, as our experience has shown, for solving DLP Elastic Search tasks next to its predecessor Sphinx is a concept car next to the Lada Kalina. But, of course, with the proviso that the search is applied to the huge DLP archives for such a specific and complex task as the investigation of information security incidents. If you, as a developer, do not have such tasks, then Sphinx may be a perfectly acceptable option.

Future plans

But, no matter how good Elastic Search is, we continue to conduct various studies to improve its search engines. Now we are engaged in the optimization of existing tools Elastic Search and conduct research on the use of neural networks as a search tool. So who knows - maybe someday we will continue this article with a chapter on artificial intelligence ...

Source: https://habr.com/ru/post/340874/

All Articles

“Active Search”: how we chose the search engine for the DLP system

Beginning of the path and basic needs: Oracle ConText, Tsearch2 and Russian context optimizer

Work speed is critical: Sphinx

Not only fast, but also smart: Elastic Search

Future plans

More articles: