📜 ⬆️ ⬇️

Fast full-text search ElasticSearch

image
When developing high-loaded sites or corporate systems, there is often a problem with the development of a fast and convenient search engine. The following are the most important, in my opinion, requirements for such an engine:


I want to tell you about the new search engine Elasticsearch, which fully meets all these requirements. The article will contain a brief description, a link to an authoritative presentation, as well as a description of the installation and working with it.

To date, there are many different implementations of such systems, each of them has its pros and cons. Since I am not an expert in this field, I ask you to take all the information listed below only as my subjective opinion.

So, I recently came across the presentation of Andrei Zmievsky (Andrei Zmievski), where he described the possibilities of elasticsearch. The presentation can be viewed here (in English).
')
Project site http://www.elasticsearch.org/

Unfortunately, I could not find any information in Russian.

What is it?


In fact, this is a new front-end to the well-known Lucene index. The main difference from competitors is flexibility and ease of use. Adding information to the index and searching the index are made using simple HTTP requests.

Installation and examples of working with the engine


I was interested in this topic and I decided to personally test this wonderful engine.
So let's get started

Installation

	 1. Download the archive (http://www.elasticsearch.org/download/) and unpack it
	 2. Start the server 
		 Unix: bin / elasticsearch –f
		 Windows: bin / elasticsearch.bat
	 3. Check the server 
		 curl -X GET http: // localhost: 9200 /
	 If everything works, the server will return you a JSON array with some information.

Indexing data

For example, create an index for users Habr

Add data about the first user
 $ curl -XPUT 'http: // localhost: 9200 / habrahabr / users / 1' -d '
 { 
  "firstname": "Piotr",
  "surname": "Petrov",
  "birthDate": "1981-01-01",
   "location": "Moscow, Russian Federation",
  "skills": ["PHP", "HTML", "C ++", ".NET", "JavaScript"]
 } '

Add data about the second user
 $ curl -XPUT 'http: // localhost: 9200 / habrahabr / users / 2' -d '
 { 
  "firstname": "Ivan",
  "surname": "Sidorov",
  "birthDate": "1978-12-13",
   "location": "Briansk, Russian Federation",
  "skills": ["HTML", "Ruby", "Python"]
 } '

Add a third user
 $ curl -XPUT 'http: // localhost: 9200 / habrahabr / users / 3' -d '
 { 
  "firstname": "Stepan",
  "surname": "Fomenko",
  "birthDate": "1985-06-01",
   "location": "Ukraine",
  "skills": ["HTML", "XML", "Java", "JavaScript"]
 } '

Search: try in action

For reference, I will give a few simple examples of the search. In fact, this engine is fully consistent with its name “elastic” and you can create a wide variety of requests. More information about the requests can be found on the project website www.elasticsearch.org/guide/reference/api

pretty parameter = true displays the answer in a more readable form

example 1: look for all users with the name Ivan
 $ curl -XGET 'http: // localhost: 9200 / habrahabr / users / _search? q = firstname: Ivan & pretty = true'

example 2: we are looking for all users from Ukraine with knowledge of PHP
 $ curl -XGET 'http: // localhost: 9200 / habrahabr / users / _search? pretty = true' -d '
 { 
     "query": { 
         "term": {"location": "Ukraine", "skills": "PHP"} 
     } 
 } '

example 3: we are looking for users from Russia
 $ curl -XGET 'http: // localhost: 9200 / habrauser / _search? q = location: Russian% 20Federation & pretty = true'

example 4: count the number of users from Russia
 $ curl -XGET 'http: // localhost: 9200 / habrauser / _count? q = location: Russian% 20Federation & pretty = true'

PS UTF8 supports normal

Testing with a large amount of data


Unfortunately, I do not have much experience with other search engines, so there is no way to compare them with elasticsearch. Curiosity decided to create an index of 5,000,000 users.

A simple script to populate the index (data is generated, but the information is more or less similar to the real one)

<?php ini_set('max_execution_time', 36000); class userGenerator { //       private $countries = array( 'Russian Federation', 'Ukraine', 'Germany', 'France', 'Lithuania', 'Latvia', 'Poland', 'Finland', 'Sweden' ); public function run($cnt) { for ($i = 0; $i < $cnt; $i++) { $query = $this->generateQuery($i); echo "generating user " . $i . " ... "; exec($query); echo "done" . PHP_EOL; } } private function generateQuery($id) { //       1960-  $date = new DateTime('1960-01-01'); return 'curl -XPUT \'http://localhost:9200/habrahabr/users/' . $id . '\' -d \' { "id" : "' . $id . '", "firstname" : "' . ucfirst($this->generateWord(10)) . '", "surname" : "' . ucfirst($this->generateWord(10)) . '", "birthDate" : "' . $date->modify('+' . rand(0, 14600) . ' days')->format('Ym-d') . '", "location" : "' . $this->generateWord(10) . ', ' . $this->countries[array_rand($this->countries)] . '", "skills" : ["' . strtoupper($this->generateWord(3)) . '", "' . strtoupper($this->generateWord(4)) . '", "' . strtoupper($this->generateWord(3)) . '"] }\' '; } private function generateWord($length) { $letters = array( "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"); $word = ''; for ($i = 0; $i < $length; $i++) { $word .= $letters[rand(0, 25)]; } return $word; } } $generator = new userGenerator(); $generator->run(5000000); echo "complete"; ?> 


Creating an index on my home (not very powerful) PC took about 5 hours. Considering the fact that I absolutely did not set up anything and did not optimize, I think that the result is quite good. Moreover, the time to generate an index is not particularly critical for me. I think if I delve into the settings, and even optimize my script so that it sends out not single but group requests (more here ), then the time would be reduced by several times. Well, if you also parallelize this process - then the time can be reduced to an hour.

Check the number of entries in the index
  curl -XGET 'http: // localhost: 9200 / habrahabr / users / _count? q = * & pretty'
 {
   "count": 5128888,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   }


Check the speed of adding a new record
  time curl -XPUT 'http: // localhost: 9200 / habrahabr / users / 5128889' -d '
 { 
  "firstname": "Basil",
  "surname": "Fedorov",
  "birthDate": "1975-07-11",
   "location": "Riga, Latvia",
  "skills": ["PERL", "PYTHON", "ActionScript"]
 } '
 {"ok": true, "_ index": "habrahabr", "_ type": "users", "_ id": "5128891", "_ version": 2}

 real 0m0.007s
 user 0m0.004s
 sys 0m0.000s


Check the speed of information retrieval
 time curl -XGET 'http: // localhost: 9200 / habrahabr / users / _search? q = location: Riga & pretty'
 {
   "took": 5,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 1,
     "max_score": 8.854725,
     "hits": [{
       "_index": "habrahabr",
       "_type": "users",
       "_id": "5128891",
       "_score": 8.854725, "_source": 
 { 
  "firstname": "Basil",
  "surname": "Fedorov",
  "birthDate": "1975-07-11",
   "location": "Riga, Latvia",
  "skills": ["PERL", "PYTHON", "ActionScript"]
 }
     }]
   }
 }
 real 0m0.011s
 user 0m0.004s
 sys 0m0.000s


 $ time curl -XGET 'http: // localhost: 9200 / habrahabr / users / _count? q = location: Germany & pretty'
 {
   "count": 570295,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   }
 }
 real 0m0.079s
 user 0m0.004s
 sys 0m0.000s


findings

In my opinion, the engine is fast, high-quality, easy to use. Feels like it is much faster than the same Zend_Search_Lucene.

In this article I described only a small part of its functionality - the simplest and most primitive functions. Outside this article are transactions, replications, filters, and many other useful features. It is also worth mentioning that Java and PHP libraries (possibly in other languages) have already been written to this engine.

P.S. I apologize for some tongue-tied text and terms.

Source: https://habr.com/ru/post/122531/


All Articles