Basics of Elasticsearch

Elasticsearch is a search engine with json rest api using Lucene and written in Java. Description of all the advantages of this engine is available on the official website . Further in the text we will call Elasticsearch as ES.

Similar engines are used for complex search in the database of documents. For example, search taking into account the morphology of the language or search by geo coordinates.

In this article I will talk about the basics of ES on the example of indexing blog posts. I will show you how to filter, sort and search for documents.

In order not to depend on the operating system, I will do all requests to ES using CURL. There is also a plugin for google chrome called sense .

The text links to documentation and other sources. At the end there are links for quick access to the documentation. Definitions of unfamiliar terms can be found in the glossary .

ES installation

To do this, we first need Java. Developers recommend installing Java versions newer than Java 8 update 20 or Java 7 update 55.

The ES distribution is available on the developer’s website . After unpacking the archive, you need to run bin/elasticsearch . Packages for apt and yum are also available. There is an official image for docker . Read more about installation .

After installation and launch we will check the performance:

 #       #export ES_URL=$(docker-machine ip dev):9200 export ES_URL=localhost:9200 curl -X GET $ES_URL

We will receive approximately the following answer:

 { "name" : "Heimdall", "cluster_name" : "elasticsearch", "version" : { "number" : "2.2.1", "build_hash" : "d045fc29d1932bce18b2e65ab8b297fbf6cd41a1", "build_timestamp" : "2016-03-09T09:38:54Z", "build_snapshot" : false, "lucene_version" : "5.4.1" }, "tagline" : "You Know, for Search" }

Indexing

Add a post to ES:

 #   c id 1  post   blog. # ?pretty ,     -. curl -XPUT "$ES_URL/blog/post/1?pretty" -d' { "title": " ", "content": "<p>   <p>", "tags": [ "", " " ], "published_at": "2014-09-12T20:44:42+00:00" }'

server response:

 { "_index" : "blog", "_type" : "post", "_id" : "1", "_version" : 1, "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : false }

ES automatically created blog index and post type . You can make a conditional analogy: an index is a database, and a type is a table in this database. Each type has its own mapping scheme, as well as a relational table. Mapping is generated automatically when the document is indexed:

 #  mapping    blog curl -XGET "$ES_URL/blog/_mapping?pretty"

In the server's response, I added in the comments the field values of the indexed document:

 { "blog" : { "mappings" : { "post" : { "properties" : { /* "content": "<p>   <p>", */ "content" : { "type" : "string" }, /* "published_at": "2014-09-12T20:44:42+00:00" */ "published_at" : { "type" : "date", "format" : "strict_date_optional_time||epoch_millis" }, /* "tags": ["", " "] */ "tags" : { "type" : "string" }, /* "title": " " */ "title" : { "type" : "string" } } } } } }

It is worth noting that ES does not distinguish between a single value and an array of values. For example, the title field contains just a title, and the tags field contains an array of strings, although they are represented equally in the mapping.
We'll talk more about mapping later.

Requests

Extract a document by its id:

 #    id 1  post   blog curl -XGET "$ES_URL/blog/post/1?pretty"

 { "_index" : "blog", "_type" : "post", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "title" : " ", "content" : "<p>   <p>", "tags" : [ "", " " ], "published_at" : "2014-09-12T20:44:42+00:00" } }

New keys appeared in the answer: _version and _source . In general, all keys starting with _ are service.

The _version key shows the document version. It is needed for the operation of the mechanism of optimistic locks. For example, we want to change a document that has version 1. We send a modified document and indicate that it is editing a document with version 1. If someone else also edited a document with version 1 and sent the changes before us, then ES will not accept our changes, because it stores a document with version 2.

The _source key contains the document we indexed. ES does not use this value for search operations, because indexes are used for the search. To save space, ES stores a compressed source document. If we need only an id, and not the entire source document, then we can disable storage of the source.

If we do not need additional information, you can only get the content of _source:

 curl -XGET "$ES_URL/blog/post/1/_source?pretty"

 { "title" : " ", "content" : "<p>   <p>", "tags" : [ "", " " ], "published_at" : "2014-09-12T20:44:42+00:00" }

You can also select only specific fields:

 #    title curl -XGET "$ES_URL/blog/post/1?_source=title&pretty"

 { "_index" : "blog", "_type" : "post", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "title" : " " } }

Let's index some more posts and execute more complex queries.

 curl -XPUT "$ES_URL/blog/post/2" -d' { "title": " ", "content": "<p>   <p>", "tags": [ "", " " ], "published_at": "2014-08-12T20:44:42+00:00" }'

 curl -XPUT "$ES_URL/blog/post/3" -d' { "title": "    ", "content": "<p>      <p>", "tags": [ "" ], "published_at": "2014-07-21T20:44:42+00:00" }'

Sorting

 #          title  published_at curl -XGET "$ES_URL/blog/post/_search?pretty" -d' { "size": 1, "_source": ["title", "published_at"], "sort": [{"published_at": "desc"}] }'

 { "took" : 8, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : null, "hits" : [ { "_index" : "blog", "_type" : "post", "_id" : "1", "_score" : null, "_source" : { "title" : " ", "published_at" : "2014-09-12T20:44:42+00:00" }, "sort" : [ 1410554682000 ] } ] } }

We chose the last post. size limits the number of documents in the issue. total shows the total number of documents matching the request. sort in output contains an array of integers by which sorting is performed. Those. The date is converted to an integer. More information about sorting can be found in the documentation .

Filters and queries

ES since version 2 does not distinguish between filters and queries, instead the concept of contexts is introduced .
The context of the request differs from the context of the filter in that the request generates _score and is not cached. What is _score I will show later.

Filter by date

Use the range request in the context of filter:

 #  ,  1    curl -XGET "$ES_URL/blog/post/_search?pretty" -d' { "filter": { "range": { "published_at": { "gte": "2014-09-01" } } } }'

Tag filtering

Use the term query to find the id of the documents containing the specified word:

 #   ,   tags    '' curl -XGET "$ES_URL/blog/post/_search?pretty" -d' { "_source": [ "title", "tags" ], "filter": { "term": { "tags": "" } } }'

 { "took" : 9, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 1.0, "hits" : [ { "_index" : "blog", "_type" : "post", "_id" : "1", "_score" : 1.0, "_source" : { "title" : " ", "tags" : [ "", " " ] } }, { "_index" : "blog", "_type" : "post", "_id" : "3", "_score" : 1.0, "_source" : { "title" : "    ", "tags" : [ "" ] } } ] } }

Full text search

Our three documents contain the following in the content field:

Use match query to find the id of documents containing the given word:

 # source: false ,     _source   curl -XGET "$ES_URL/blog/post/_search?pretty" -d' { "_source": false, "query": { "match": { "content": "" } } }'

 { "took" : 13, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 0.11506981, "hits" : [ { "_index" : "blog", "_type" : "post", "_id" : "2", "_score" : 0.11506981 }, { "_index" : "blog", "_type" : "post", "_id" : "1", "_score" : 0.11506981 }, { "_index" : "blog", "_type" : "post", "_id" : "3", "_score" : 0.095891505 } ] } }

However, if you search for "stories" in the content field, then we will not find anything, because The index contains only original words, not their bases. In order to make a quality search, you need to configure the analyzer.

The _score field shows relevancy . If the request is executed in the filter context, then the value of _score will always be 1, which means full compliance with the filter.

Analyzers

Analyzers are needed to convert the source code into a set of tokens.
Analyzers consist of one Tokenizer and several optional TokenFilters . Tokenizer may precede multiple CharFilters . Tokenizer breaks the source string into tokens, for example, by spaces and punctuation. TokenFilter can change tokens, delete or add new ones, for example, leave only the stem of a word, remove prepositions, add synonyms. CharFilter - changes the source string entirely, for example, cuts html tags.

There are several standard analyzers in ES. For example, the analyzer russian .

Let's use api and see how standard and russian analyzers transform the string "Funny stories about kittens":

 #   standard #     ASCII  curl -XGET "$ES_URL/_analyze?pretty&analyzer=standard&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"

 { "tokens" : [ { "token" : "", "start_offset" : 0, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "", "start_offset" : 8, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "", "start_offset" : 16, "end_offset" : 19, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "", "start_offset" : 20, "end_offset" : 25, "type" : "<ALPHANUM>", "position" : 3 } ] }

 #   russian curl -XGET "$ES_URL/_analyze?pretty&analyzer=russian&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"

 { "tokens" : [ { "token" : "", "start_offset" : 0, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "", "start_offset" : 8, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "", "start_offset" : 20, "end_offset" : 25, "type" : "<ALPHANUM>", "position" : 3 } ] }

The standard analyzer broke the line in spaces and translated everything into lower case, the Russian analyzer removed not meaningful words, translated it into lower case and left the basis of words.

Let's see which Tokenizer, TokenFilters, CharFilters uses the russian analyzer:

 { "filter": { "russian_stop": { "type": "stop", "stopwords": "_russian_" }, "russian_keywords": { "type": "keyword_marker", "keywords": [] }, "russian_stemmer": { "type": "stemmer", "language": "russian" } }, "analyzer": { "russian": { "tokenizer": "standard", /* TokenFilters */ "filter": [ "lowercase", "russian_stop", "russian_keywords", "russian_stemmer" ] /* CharFilters  */ } } }

We describe our analyzer based on russian, which will cut html tags. Let's call it default, because the analyzer with this name will be used by default.

 { "filter": { "ru_stop": { "type": "stop", "stopwords": "_russian_" }, "ru_stemmer": { "type": "stemmer", "language": "russian" } }, "analyzer": { "default": { /*   html  */ "char_filter": ["html_strip"], "tokenizer": "standard", "filter": [ "lowercase", "ru_stop", "ru_stemmer" ] } } }

First, all html tags will be deleted from the source line, then it will be broken into tokens standard tokens, the received tokens will go to lower case, insignificant words will be removed and the remaining tokens will remain the basis of the word.

Create index

Above, we described the default analyzer. It will apply to all string fields. Our post contains an array of tags, respectively, the tags will also be processed by the analyzer. Since we are looking for posts for exact matching the tag, it is necessary to disable the analysis for the tags field.

Create a blog2 index with an analyzer and a mapping in which the analysis of the tags field is disabled:

 curl -XPOST "$ES_URL/blog2" -d' { "settings": { "analysis": { "filter": { "ru_stop": { "type": "stop", "stopwords": "_russian_" }, "ru_stemmer": { "type": "stemmer", "language": "russian" } }, "analyzer": { "default": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "ru_stop", "ru_stemmer" ] } } } }, "mappings": { "post": { "properties": { "content": { "type": "string" }, "published_at": { "type": "date" }, "tags": { "type": "string", "index": "not_analyzed" }, "title": { "type": "string" } } } } }'

Add the same 3 posts to this index (blog2). I will omit this process because it is similar to adding documents to the blog index.

Full-text search with expressions support

Let's get acquainted with another type of query:

 #  ,     '' # query -> simple_query_string -> query    #  title   3 #  tags   2 #  content   1 #      curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d' { "query": { "simple_query_string": { "query": "", "fields": [ "title^3", "tags^2", "content" ] } } }'

Since we use an analyzer with Russian stemming, then this query will return all documents, although they only contain the word 'history'.

The request may contain special characters, for example:

 "\"fried eggs\" +(eggplant | potato) -frittata"

Query syntax:

 + signifies AND operation | signifies OR operation - negates a single token " wraps a number of tokens to signify a phrase for searching * at the end of a term signifies a prefix query ( and ) signify precedence ~N after a word signifies edit distance (fuzziness) ~N after a phrase signifies slop amount

 #     '' curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d' { "query": { "simple_query_string": { "query": "-", "fields": [ "title^3", "tags^2", "content" ] } } }' #  2

PS

If such articles are interesting, there are ideas for new articles or there are proposals for cooperation, I will be glad to post in PM or email m.kuzmin+habr@darkleaf.ru.

Source: https://habr.com/ru/post/280488/

All Articles