📜 ⬆️ ⬇️

How we set up a search using Elasticsearch and Logstash using MSSQL data

We share the details, as we have done a good search on a closed corporate social network in conditions where:


• data is stored in different columns of MSSQL tables,
• there was no search before them,
• and it is expensive to transfer them from there - the whole system is tied to MSSQL. Using third-party services will not work for information security reasons.


The criterion for a good search for us is: even if the user entered a request with a typo or inaccurately indicated the name of the group, he still found it.


Also for the future, we needed to think about how to search for hashtags as separate words, search for synonyms, ranking results and issuing intermediate results on the fly.




Why ElasticSearch?


Standard tools (however, non-standard ones too) MSSQL did not allow to do this search. Despite the fact that MSSQL has Full-Text Search, this solution could lead to problems for the future. First, the client doesn’t have the latest version of SQL Server, and the search in it doesn’t work as well as in the new versions. Secondly, work with typos, work with hashtags and ranking would have to be implemented manually.


Also abandoned SharePoint, because he can only search through the lists. Starting another list and keeping it up to date has turned out to be too complicated. Lists, especially on large volumes, do not differ in reliability and speed of work. And full-text search in SharePoint is inferior in quality to MSSQL.


It was impossible to use any third-party service for security reasons. Therefore, the choice fell on ElasticSearch.


How to index data in Elasticsearch?


It was necessary to index the data in Elasticsearch and keep them up to date. When changes occur in the data, the Elastic index should automatically overwrite them, but not overwrite everything — only the new.


It became clear that it was necessary to look for a solution that works with the JDBC driver. This is the only obvious way to access data in MSSQL from Elastic.


The first thing that came to hand was Elastic-JDBC . But the latest update of this solution dates back to August 2016 and works with ElasticSearch 2.3.4. Obviously, tying a solution immediately to an outdated version is the wrong way. No other ready-made solutions for working directly with Elastic could be found.


Downloading data through Logstash


Currently, Logstash is usually responsible for loading data into ElasticSearch. For him, a ready solution was found - the JDBC Input plugin . It would seem that a solution was found and it remained only to correctly install and configure it, but some minor difficulties arose. We will tell about them below.


Logstash setup: our experience and tips


Download and put on the instructions next to the Logstash. The setting is described in some detail on the page of the plugin, but there are some subtleties.


This is how the configuration file looks like:


input { jdbc { jdbc_driver_library => "etc/logstash/bin/sqljdbc42.jar" jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver" jdbc_connection_string => "connection string" jdbc_user => "user" jdbc_password => "password" statement => "SELECT id, name, timestamp FROM [TableName] WHERE timestamp > :sql_last_value" schedule => "* * * * *" tracking_column => "timestamp" tracking_column_type => "timestamp" use_column_value => true } } 

From the non-standard moments it is possible to notice statement. Here we specify which columns we will index. Sql_last_value is a special value that is overwritten with each iteration. The plugin takes the maximum value that can be detected at boot. This can be either a number or a timestamp. That is, we can tie in with id and index only new entries, or we can tie in the DateTime column and track the time of the last update. Records that were changed or added later will be indexed. You can specify not one Input, but as many as the tables you need to load.


We advise you to customize Output so that the Id in Elastic coincides with the Id in your database. This eliminates the possibility of data duplication.


Problems encountered


Logstash is configured to work with logs and does not understand why they should be deleted. Who in their right mind would think to delete logs? This functionality simply did not exist. Accordingly, the deletion had to be registered with the hands, to track the deletion from the database and deleted from Elasticsearch.
Changes and additions were implemented in this way: we added a timestamp column (last modified date) to the MSSQL database. If something is changed, this date and time changes to the current one. And this driver tracks the latest changes - once a minute it checks the database and it has the timestemp of the last change.


Good advice


Few tips for implementing this functionality. If you have a CodeFirst database, add an interface:


 internal interface IDateModified { DateTime TimeStamp { get; set; } } 

Implement this interface for all tables that will then be indexed in ElasticSearch. If you have DbFirst, you can edit the tt file to add an interface.
Further, when you save the context, it will be very easy for you to track the changes of the tables of interest to you.


 var trackables = context.ChangeTracker.Entries<IDateModified>().Where(t => t.State == EntityState.Modified || t.State == EntityState.Added); foreach (var item in trackables) { item.Entity.TimeStampForElastic = DateTime.UtcNow; } 

You can also track EntityState.Deleted and delete from Elastic at the same time as the base.


Total


Thus, we managed to get relevant data from MSSQL right on the fly and, when making requests, first contact Elastic, get the necessary IDs from there that satisfy the search criteria, and then see if they are in the database, and get them out pretty quickly.


There is a huge amount of articles on setting up search indexes in ElasticSearch. We looked at this article - here are described the possibilities for full-text search. It was also desirable, but we did not have the opportunity to store all the data in Elasticsearch. Our data in Elasticsearch is not completely, but only those for which you need to perform a search.


We specified in the settings what data is involved in the search: we search by news, by groups, by hashtags, by name and description of the group, ignoring the number of participants.


Here is how we set up the search using MSSQL. In the future, we want to add a search on social network messages.


PS Bonus catch the link to the plugin for the Russian morphology that we used.


')

Source: https://habr.com/ru/post/347966/


All Articles