📜 ⬆️ ⬇️

How we did a search in elasticsearch on vulners.com


As we wrote earlier, elasticsearch is used as the main base for searching on the site. Search in elastic works very quickly and many useful functions are available from the box for working with data - full-text search, inaccurate search, various aggregation methods, and so on.

And unlike classical SQL databases or noSQL such as MongoDB, it is very convenient to do an inaccurate search throughout the document. This is done using the Query DSL syntax. For full-text search throughout the document there are several search queries. We use the query_string type on our site. This request supports the Lucene syntax, which allows both us and the user to create complex google-style requests. Here are examples of such requests:

title: apache AND title: vulnerability
type: centos cvss.score: [8 TO 10]

You can make such a simple query and all:
')
{ "query": { "query_string": { "query": "exploit wordpress" } } } 

But when you start using query_string for the first time, you will find that the search returns not what you want to see. How to achieve a distinct search result from elasticsearch?

And here for the first time we are confronted with such a concept in elasticsearch, as relevance, it’s also a score. A very detailed description is available on the official website , but I’ll just say that score shows how the document matches your request. Each document found contains a _score field and the search results are automatically sorted by him. In most cases, this is appropriate, but if the user wants to sort by date? Then in the specified json request you need to pass an additional sort field, like this:

 { "query": { "query_string": { "query": "hackapp" } }, "sort": "published" } 

It means that in the post-request it is necessary to send this field separately, forcing the user to enter it somewhere or select, for example, from the drop-down list. But why not do it right in the search query? And we come to the crutch number 1. We are looking for a regular phrase in the search line for a type phrase (order | sort): \ w + , we take it out and pass the specified field in an additional parameter to json.

We also spotted such a dork - from the great comrades from Wallarm - last N days. We immediately liked it, because you can very quickly watch for vulnerabilities in the last month, for example. As you can guess, you can also write this directly in the search box. Regularly it gets out and is substituted in the request. At the same time, it is not necessary to make clever type calculations for calculating dates. You can set the condition in this form - {"gte": "now-3d / d"} . We also discover a new type of query in elasticsearch - bool and filter. As a result, after two hacks, we already have this query:

 { "query": { "bool": { "filter": { "range": { "published": { "gte": "now-3d/d" } } }, "should": { "query_string": { "query": "wordpress" } } } } } 

The search seems to work, but this notorious relevance leaves much to be desired. I want to podtyunit it. Discovering the concept of boost. Depending on a particular criterion, we can influence the final score. The easiest way in the case of searching through query_string is to set the fields with the task of the coefficient. It is set this way by specifying the additional fields parameter:

 { "query": { "bool": { "filter": { "range": { "published": { "gte": "now-3d/d" } } }, "should": { "query_string": { "query": "wordpress", "fields": [ "title^2", "type^3", "affectedPackage.packageName^3", "affectedSoftware.name^3", "_all" ], "default_operator": "AND" } } } } } 

Moreover, if the task of the coefficient 2 does not increase the score by 2 times, it will make the document 2 times more relevant).

We also set the default_operator parameter along the way so that the words listed in the query are searched by default with the AND condition.

The search has become better, but we are faced with cases where articles with a large number of references to a certain topic come out in the top, completely removing more important new vulnerabilities or exploits from the user. It was decided to fix it in two ways. Add a boost based on document type. At the same time, we want to list only those types that should lower or raise the final rating, that is, the condition may not be fulfilled. To do this, you need to use a search with the type bool , specifying the must condition that must be exactly executed (user query) and the optional condition should, in which we list the necessary types.

 { "query": { "bool": { "must": [ { "bool": { "minimum_should_match": 1, "should": [ { "query_string": { "default_operator": "AND", "fields": [ "id^4", "title^3", "_all" ], "query": "wordpress" } } ] } } ], "should": [ { "bool": { "minimum_should_match": 0, "should": [ { "term": { "boost": 2.5, "type": "exploit" } }, { "term": { "boost": 2, "type": "software" } }, { "term": { "boost": 0.3, "type": "info" } } ] } } ] } } } 

And besides this, the desire to add new documents to the beginning of the found is added. Here we cannot do with a linear factor, therefore instead of the usual query we use function_score . Initially created query must be inserted into the query inside function_score and also the function itself must be specified. We use the modified field and Gaussian distribution. In this case, the current date is considered as the initial mark. Such a factor in elasticsearch can be used for numeric types, dates and geolocation, while you can set any starting point and this is a huge plus from the use of elasticsearch. Total our request takes the following form:

 { "from": 0, "query": { "function_score": { "functions": [ { "weight": 1 }, { "gauss": { "modified": { "scale": "12w" } } } ], "query": { "bool": { "must": [ { "bool": { "minimum_should_match": 1, "should": [ { "query_string": { "default_operator": "AND", "fields": [ "id^4", "title^3", "_all" ], "query": "wordpress" } } ] } } ], "should": [ { "bool": { "minimum_should_match": 0, "should": [ { "term": { "boost": 2, "type": "unix" } }, { "term": { "boost": 2.5, "type": "exploit" } }, { "term": { "boost": 2, "type": "software" } }, { "term": { "boost": 2, "type": "nvd" } }, { "term": { "boost": 0.3, "type": "info" } } ] } } ] } } } }, "size": 20, "sort": [ { "_score": { "order": "desc" } }, { "published": { "order": "desc" } } ] } 

It remains the final touch - we add highlight to make it easier to determine why the specified document was chosen, and on this our small query is ready:

 { "from": 0, "query": { "function_score": { "functions": [ { "weight": 1 }, { "gauss": { "modified": { "scale": "12w" } } } ], "query": { "bool": { "must": [ { "bool": { "minimum_should_match": 1, "should": [ { "query_string": { "default_operator": "AND", "fields": [ "id^4", "title^3", "_all" ], "query": "wordpress" } } ] } } ], "should": [ { "bool": { "minimum_should_match": 0, "should": [ { "term": { "boost": 2, "type": "unix" } }, { "term": { "boost": 2.5, "type": "exploit" } }, { "term": { "boost": 2, "type": "software" } }, { "term": { "boost": 2, "type": "nvd" } }, { "term": { "boost": 0.3, "type": "info" } } ] } } ] } } } }, "size": 20, "sort": [ { "_score": { "order": "desc" } }, { "published": { "order": "desc" } } ], "highlight": { "fields": { "*": { "fragment_size": 100, "number_of_fragments": 4, "post_tags": [ "</span>" ], "pre_tags": [ "<span class=\"vulners-highlight\">" ], "require_field_match": false } } } } 

If you bring a dry residue - elasticsearch is easy to learn for basic queries, but if you need to do a relevant and flexible search through documents with completely different content, then you should be patient.

Source: https://habr.com/ru/post/310688/


All Articles