📜 ⬆️ ⬇️

How to distinguish shampoo from champignons, and skewers from champagne ... Elasticsearch - search for goods in store databases

Task


One of the main tasks of the application for storing and analyzing purchases is to search for the same or very close products in the database, where unsuited and incomprehensible product names obtained from checks are collected. There are two types of input request:


  1. A specific name with abbreviations that can only be understood by cashiers at a local supermarket, or by avid shoppers.
  2. A query in natural language entered by the user into the search string.

Requests of the first type, as a rule, come from the products in the check itself, when the user needs to find cheaper products. Our task is to select the most similar analogue of the goods from the check in other stores nearby. Here it is important to choose the most appropriate brand of product and, if possible, volume.



The second type of request is a simple user request to search for a specific product in the nearest store. The request can be a very general, non-uniquely describing product. There may be slight deviations from the request. For example, if a user is looking for milk at 3.2%, and in our database there is only milk at 2.5%, then we still want to show at least this result.



Features dataset - problems to solve


The information in the product check is far from perfect. There are many not always clear abbreviations, grammatical errors, misprints, various translites, Latin letters in the middle of the Cyrillic alphabet and character sets that make sense only for the internal organization in a particular store.
For example, a baby apple-banana puree with cottage cheese can easily be written on the check like this:



About technology


Elasticsearch is a fairly popular and affordable technology for implementing a search. This is a search engine with JSON REST API using Lucene and written in Java. The main advantages of Elastic are speed, scalability and fault tolerance. Similar engines are used for complex search in the database of documents. For example, search taking into account the morphology of the language or search by geo-coordinates.


Directions for Experiments and Improvements


To understand how you can improve your search, you need to parse the search system into its component customizable components. For our case, the structure of the system looks like this.


  1. The input string for the search passes through the analyzer, which in a certain way splits the string into tokens - search units that are searched among data that is also stored as tokens.
  2. Then there is a direct search for these tokens for each document in the existing database. After finding a token in a particular document (which is also represented in the database as a token set), its relevance is calculated according to the selected Similarity model (we will call it the relevance model). This may be a simple TF / IDF (Term Frequency - Inverse Document Frequency), and there may be other more complex or specific models.
  3. In the next step, the number of points scored by each individual token is aggregated in a certain way. Aggregation parameters are specified by query semantics. An example of such aggregations can be additional weights for certain tokens (added value), conditions for the mandatory presence of a token, and so on. The result of this stage is Score - the final assessment of the relevance of a particular document from the database relative to the initial request.


Three separately configurable components can be distinguished from the search device, each of which can be distinguished with its own ways and means of improvement.


  1. Analyzers
  2. Similarity model
  3. Query-time improvements

Next, we consider each component separately and analyze the specific settings of parameters that helped improve the search in the case of product names.


Query-time improvements


To understand what we can improve in a query, we give an example of the initial query.


{ "query": { "multi_match": { "query": "  105", "type": "most_fields", "fields": ["name"], "minimum_should_match": "70%" } }, “size”: 100, “min_score”: 15 } 

We use the most_fields query type, since we foresee the need for a combination of several analyzers for the “product name” field. This type of query allows you to combine search results for different attributes of an object containing the same text, analyzed in different ways. An alternative to this approach is the use of best_fields or cross_fields queries, but they are not suitable for our case, since the search among the various attributes of the object is calculated (eg: name and description). We also have the task to take into account different aspects of one attribute - the name of the product.


What can be customized:



Analyzers


The analyzer analyzes the input line in three stages and issues tokens at the output - search units:



First, changes occur at the character level of the string. This can be a replacement, delete or add characters to the string. Then the tokenizer comes into play, which is designed to divide the string into tokens. Standard tokenizer splits a string into tokens according to punctuation marks. The final step is the received tokens are filtered and processed.


Consider what variations of stages have become useful in our case.


Char filters


Tokenizer


Token filter



Similarity model


The relevance model is needed in order to determine to what extent the coincidence of one token affects the relevance of the object from the base with respect to the query. Suppose if in the request and the product from the database the token 'for' coincides, this is certainly not bad, but it says little about the compliance of the product with the request. Thus, the coincidence of different tokens carries different values. In order to take this into account and need a relevancy model. Elastic provides many different models. If you understand the truth in them, then you can choose a very specific and suitable model for a specific case. The choice can be based on the number of frequently used words (by the type of the same token 'for'), assessment of the importance of long tokens (longer means better? Worse? Doesn't matter?), What results we want to see at the end and so on. Examples of models that are offered in Elastic are TF-IDF (the simplest and most understandable model), Okapi BM25 (improved TF-IDF, the default model), Divergence from randomness, Divergence from independence and so on. Each model also has customizable parameters. After several experiments with the model, the Okapi BM25 default model showed the best result, but with different parameters from the pre-set ones.


Use of categories


In the course of the search operation, a very important additional information about the product became available - its category. You can read more about how we implemented categorization in the article How I understood that I eat a lot of sweets, or classification of goods by checks in the appendix . Until then, we based our search only on the comparison of the names of goods, it is now possible to compare the category of the request and the goods in the database.
Possible options for using this information are a mandatory match on the category category (drawn up as a result filter), an additional advantage in the form of points for products with a matching category (as is the case with numbers) and sorting the results by category (first matching, then all the others). For our case, the last option worked best. This is due to the fact that our categorization algorithm is too good to use the second method, and not good enough to use the first one. We are confident enough in the algorithm and want the coincidence in the category to be a big advantage. If additional points are added to the score (the second method), goods with the same category will go up in the list, but they will still lose to some goods that have more matched in name. However, the first method does not suit us either, since errors in categorization are still not excluded. Sometimes a request may contain insufficient information to correctly determine the category, or there are too few products in this category in the nearest user radius. In this case, we still want to show results with a different category, but still relevant to the product name.
The second method is also good because it does not “spoil” the speed of the products as a result of the search, and allows you to continue to use the calculated minimum rate without obstacles.
The specific implementation of sorting can be seen in the final query.


Final model


The selection of the final search model included the generation of various search engines, their evaluation and comparison. Most often, the comparison took place in one of the parameters. Suppose in one particular case we needed to calculate the best size for the edgeNgram tokenizer (that is, to select the most efficient range N). To do this, we generated the same search engines with only one difference in the size of this range. After that it was possible to identify the best value for this parameter.
The models were evaluated using the nDCG (normalized discounted cumulative gain) metric - a metric for assessing the quality of ranking. Predefined queries were sent to each search model, after which the nDCG metric was calculated using the search results obtained.
Analyzers that are included in the final model:






The default model (Okapi - BM25) with the parameter b = 0.5 was chosen as the relevance model. The default value is 0.75. This parameter shows to what extent the length of the product name normalizes the tf (term frequency) variable. A smaller number in our case works better, since a long name very often does not affect the significance of a single word. That is, the word “chocolate” in the name “chocolate with crushed hazelnuts in a package of 25pcs” does not lose its value because the name is rather long.


The final query looks like this:



Experimentally revealed the best combination of analyzers and weights.


As a result of sorting, products with a matching category go to the beginning of the results, and then all the others. Sorting by points (skoru) is stored within each subset.


For simplicity, this query specifies a threshold for a score of 15, although in our system we use the calculated parameter that was described earlier.


In future


There are thoughts to improve the search by solving one of the most embarrassing problems of our algorithm, which is the existence of a million and one different way to shorten a word of 5 letters. It can be solved by first processing the names in order to reveal the abbreviations. One of the solutions is an attempt to compare the abbreviated name from our database with one of the names from the database with “correct” full names. This decision meets its own specific obstacles, but it seems a promising change.


')

Source: https://habr.com/ru/post/433070/


All Articles