📜 ⬆️ ⬇️

Comparison of products using Elasticsearch for a competitor price monitoring service

Back in 2017, the idea arose to develop a service for monitoring the prices of competitors. Its distinctive feature from other similar services should have been the function of daily automatic comparison of goods. Apparently because of the almost complete lack of information on how to do this, price monitoring services offered only the possibility of manual comparison by the customers themselves, or by service operators with a price from 0.2 to 1 ruble for the fact of comparison. The real situation with, for example, 10 sites and 20,000 products on each, inevitably requires automation of the process, since manual matching is too long and expensive.

An approach to automatic matching will be described below using a number of competing pharmacies using the Elaticsearch technology.

Environment description


  1. OS: Windows 10
  2. Base: Elaticsearch 6.2
  3. Client for requests: Postman 6.2

Elaticsearch setup


Configuration of the product mapper and analyzer fields in one request

PUT http://localhost:9200/app { "mappings": { "product": { "properties": { "name": { "type": "text", "analyzer": "name_analyzer" #        }, "manufacturer": { "type": "text" }, "city_id": { "type": "integer" }, "company_id": { "type": "integer" }, "category_id": { "type": "integer" }, } } }, "settings": { "index": { "analysis": { "analyzer": { "name_analyzer": { "type": "custom", "tokenizer": "standard", #        ,       "char_filter": [ "html_strip", #       html  "comma_to_dot_char_filter" #    ,     ], "filter": [ "word_delimeter_filter", #     "synonym_filter", #    "lowercase" #      ] } }, "filter": { "synonym_filter": { "type": "synonym_graph", "synonyms": [ ", ", ", ", ", ", ", , ", ", ", ", , , ", ", ", ", ", ", , ", ", , ", ", , , -, -", ", , ", ", , , ", ", ", ", ", ", , ", ", , ", ", ", ", ", ", ", ", ", ", , , ", ", ", ", g", "ml, " ] }, "word_delimeter_filter": { "type": "word_delimiter", "type_table": [ ". => DIGIT", #       "- => ALPHANUM", "; => SUBWORD_DELIM", "` => SUBWORD_DELIM" ] } }, "char_filter": { "comma_to_dot_char_filter": { "type": "mapping", "mappings": [ ", => ." ] } } } } } } 

For example, we can look at which parts of the analyzer “name_analyzer” will break the name of the drug “Hyoxysone 10mg + 30mg / g ointment for external use of the 10g tube”. Use the query _analyze .
')
 POST http://localhost:9200/app/_analyze { "analyzer" : "name_analyzer", "text" : " 10+30/      10" } 

result
 { "tokens": [ { "token": "", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 }, { "token": "10", "start_offset": 10, "end_offset": 12, "type": "<ALPHANUM>", "position": 1 }, { "token": "", "start_offset": 12, "end_offset": 14, "type": "<ALPHANUM>", "position": 2 }, { "token": "30", "start_offset": 15, "end_offset": 17, "type": "<ALPHANUM>", "position": 3 }, { "token": "", "start_offset": 17, "end_offset": 19, "type": "<ALPHANUM>", "position": 4 }, { "token": "g", "start_offset": 20, "end_offset": 21, "type": "SYNONYM", #,   "g"   SYNONYM,  ,          ", g" "position": 5 }, { "token": "", "start_offset": 20, "end_offset": 21, "type": "<ALPHANUM>", "position": 5 }, { "token": "", "start_offset": 22, "end_offset": 26, "type": "<ALPHANUM>", "position": 6 }, { "token": "", "start_offset": 27, "end_offset": 30, "type": "<ALPHANUM>", "position": 7 }, { "token": "", "start_offset": 31, "end_offset": 40, "type": "<ALPHANUM>", "position": 8 }, { "token": "", "start_offset": 41, "end_offset": 51, "type": "<ALPHANUM>", "position": 9 }, { "token": "", "start_offset": 52, "end_offset": 56, "type": "<ALPHANUM>", "position": 10 }, { "token": "10", "start_offset": 57, "end_offset": 59, "type": "<ALPHANUM>", "position": 11 }, { "token": "g", "start_offset": 59, "end_offset": 60, "type": "SYNONYM", "position": 12 }, { "token": "", "start_offset": 59, "end_offset": 60, "type": "<ALPHANUM>", "position": 12 } ] } 

Filling with test data


_Bulk request

 POST http://localhost:9200/_bulk { "index": { "_index": "app", "_type": "product", "_id": 195111 } } { "name": " 10+30/      10", "manufacturer": "   ", "city_id": 1, "company_id": 2, "category_id": 1 } { "index": { "_index": "app", "_type": "product", "_id": 195222 } } { "name": "     10 +30 /: 10 ", "manufacturer": "", "city_id": 1, "company_id": 3, "category_id": 1 } 

Search mappings


Let the goods of our client, for which we want to find all similar products of competitors, have characteristics

 { "name": "     10 +30 /   10 ", "manufacturer": "   ", "city_id": 1, "company_id": 1, "category_id": 1 } 

Using the reference book of medicines we select the name of the drug from the name of the product. In this case, the word "hyoxyson" This word will be a mandatory criterion.

We also cut out all the numbers from the name - “10 30 10”, they will also be an obligatory criterion. Moreover, if some number was included twice, it should also be included in the found goods, otherwise we will increase the chance of coincidence with the wrong goods.

_Search query

 GET http://localhost:9200/app/product/_search { "query": { "bool": { "filter": [ { "terms": { "company_id": [ 2, 3, 4, 5, 6, 7, 8 ] } }, { "term": { "city_id": { "value": 1, "boost": 1 } } }, { "term": { "category_id": { "value": 1, "boost": 1 } } } ], "must": [ { "bool": { "should": [ { "match": { "name": { "query": "    + /   ", "boost": 1, "operator": "or", "minimum_should_match": 0, "fuzziness": "AUTO" } } } ], "must": [ { "match": { "name": { "query": "", "boost": 2, "operator": "or", "minimum_should_match": "70%", "fuzziness": "AUTO" } } }, { "match_phrase": { "name": { "query": "10 30 10", "boost": 2, "slop": 100 } } } ] } } ], "should": [ { "bool": { "should": [ { "match": { "manufacturer": { "query": "   ", "boost": 1, "operator": "or", "minimum_should_match": "70%", "fuzziness": "AUTO" } } }, { "match": { "manufacturer": { "query": "alenta armacevtika ", "boost": 1, "operator": "or", "minimum_should_match": "70%", "fuzziness": "AUTO" } } } ] } } ] } }, "highlight": { "fields": { "name": {} } }, "size": 50 } 

At the exit, we get the id of the goods, as well as their names + score for analytics, with highlighted fragments.


Conclusion


The described method certainly will not give 100% accuracy of comparison, but it will greatly facilitate the process of manual comparison of goods. Also suitable for tasks that do not require absolute accuracy.
In general, if we improve the search query with the methods of additional heuristics and increasing the number of synonyms, we can achieve a result close to satisfactory.
In addition, the performance tests performed on the old i7, showed good results. 10 search queries in an array of 200,000 products run within a couple of seconds. Live this example of drugs can be found here .

Offer your options, ways of comparison in the comments.

Thanks for attention!

Source: https://habr.com/ru/post/428814/


All Articles