ElasticSearch and search vice versa. Percolate API

The question of intelligent categorization of something arises acutely in the development of very many sites. Of course, you can always give it to a person to fill out and the result will initially be much better than machine, but that if you categorize in real time hundreds and thousands of “goods”.
We'll have to give it to the car. There are not so many options, and writing your own AI for 99.9% of tasks is a waste of time.

If you are interested in how to solve this with the help of ElasticSearch, please ask for cat.

If you are not familiar with ElastiSearch, then I recommend the excellent article "Fast full-text search ElasticSearch" from brujeo
')

General idea

In SmartProgress, we implemented categorization as groups that integrate user goals into general interest groups. But how to relate these groups (of which there are already more than 100) with the user's goal in such a way that he was offered a choice of a maximum of 3 groups and at the same time they had the most relevant goals?

The simplest option would be to use, for example, tags to bind to a particular group, but in reality it does not work as well as we would like, plus forcing the same users to fill in tags can be justified only for the IT sphere.

Suppose we have the category “Programming on Ruby on Rails”, then the search query by the rules of the simple query string will look like this:
Ruby | RoR | "Ruby on Rails" | " Ruby"~4 | " " -php -java -net
^{I will clarify the query a little: |} ^{- or, "..." - occurrences of the entire phrase, ~ N - perhaps dilution of the phrase N by words}

If you need to find all the “goods” (in our case, goals) that fit this query, then simply search. And if you need to find all the categories for a particular product? Percolate API comes to the rescue

Percolate API

I admit honestly, my acquaintance with ElasticS began with this "chip", before that I worked only with the sphinx, but he does not know how to do a reverse search.
Therefore, after reading the documentation, I did not really understand what it is and how to work with it, and there was very little information in Google, especially for the version> 1.X. But perseverance won (on> 5 page of Google there is life).

I will try to explain on the fingers how it works:

We create an index or take an existing one.
It adds a document (s) with a special type of .percolator , any unique id and c body in the form of our request (example below)
Next, we make a request to _percolate and look at which categories the “product” fits.

Working example

Let's try it in action:

- Create an index "test" (without mapping, we will not need it)

 curl -XPUT 'http://localhost:9200/test'

- Create .percolator

 curl -XPUT 'http://localhost:9200/test/.percolator/simple-search' -d ' { "query" : { "simple_query_string" : { "query" : "Ruby | RoR | \"Ruby on Rails\" | \" Ruby\"~4 | \"  \" -php -java -net", "analyzer" : "simple", "fields" : ["name^5", "description"], "default_operator" : "and" } }, "language" : "ru", }'

More details:
test - Index
.percolator - Type
simple-search - ID (can be both int and string)
"query" - search
simple_query_string - The function to search. Full list
"fields" : ["name^5", "description"] - here we have indicated which fields are being searched, and indicated a factor of 5 for the "name" field, since usually there is the most important information. More details .
"active" : 1 - Additional parameters, not mandatory, can be several of any type, are used in filtering the result.

In fact, .percolator is the same object as any other in the index, so mapping can also be applied to it.

- We are looking for:

 curl -XPOST 'http://localhost:9200/test/category/_percolate?pretty' -d ' { "doc" : { "name" : " Ruby on Rails  ", "description" : "    Ruby" }, "filter" : { "term" : { "language" : "ru" } } }'

Answer:

 { "took" : 5, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "total" : 1, "matches" : [ { "_index" : "test", "_id" : "simply-search" } ] }

There may be several matches.

This method is great if for some reason you don’t want to transfer all the information to ElasticS. Or you want to test your percolator

Search by existing data (ElasticS> = v1.0)

Let's add 1 entry to the test index

 curl -XPUT 'http://localhost:9200/test/category/1' -d ' { "name" : " Ruby on Rails  ", "description" : "    Ruby" }'

And look at what categories this entry fits:

 curl -XGET 'http://localhost:9200/test/category/1/_percolate?pretty'

Answer:

 { "took" : 4, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "total" : 1, "matches" : [ { "_index" : "test", "_id" : "simply-search" } ] }

Well, well, we were able to do the “reverse search”. Where can this be used? Applications of the sea, from tricky samples to reminders of the event, everything is limited only by your imagination and RAM.

A fly in the ointment

Unfortunately, not everything is as wonderful as it looks from the outside. Yes, it works, but there are downsides :

All .percolator is stored in RAM
Each document is indexed in RAM.
Runtime is linear to the number of .percolator indexes

Yes, they support replication, as any ElasticSearch object, but nevertheless it is necessary to use this mechanism extremely carefully.

A couple of simple tips to avoid out of memory:

If your sample can be made simpler, for example, using macth / bool query then use this, query language is rather slow compared to the usual comparison of values
Use filters, narrow the search as much as the application logic allows, it will save you some memory.
Do not create too many .percolator indexes, if you have thousands of such indexes, then you should revise your logic or stock up with RAM

Useful information

Whats new in percolator - an excellent presentation from ES developers, which very clearly explains the essence of technology
Percolator API - official documentation page
DHC - REST HTTP API Client - a great plugin for Google chrome that allows you to quickly and conveniently communicate with ES

Still my article on ES - ElasticSearch - data aggregation

PS I am not an ES guru, so I am glad for any comments and additions.

Source: https://habr.com/ru/post/226749/

All Articles