Elasticsearch as NoSQL database

Can Elasticsearch search server be used as NoSQL database? A positive response will allow to consider its various properties, including those from the implementation of which he refused to become one of the most flexible, productive and scalable search engines. But to answer this question, you first need to decide on the NoSQL term itself, since, depending on the context, it can be interpreted differently.

What is NoSQL?

NoSQL developers give the following definition: a next-generation database characterized by a rejection of relationality, distribution, open source, and horizontal scalability. Call this definition accurate, alas, it is impossible.

The fact is that this is not about SQL. We will explain. The Hive query language was clearly inspired by SQL. The same can be said about the Esper language, although it works not with streams, but with relationships. The history of PostgreSQL is interesting - initially it was called Postgres, Quel was used as the query language and was ORDBMS, and today PostgreSQL has many functions that allow it to be document-oriented storage.
')
In this case, this is not about ACID - in the definition of NoSQL nothing is said about transactions. Hyperdex is a NoSQL database that aims to provide ACID transactions. MySQL is undoubtedly a SQL database and in its history has questionable interpretations on what ACID actually means.

Relations. Most NoSQL databases do not support the join operation as traditional relational databases do, and leave this work to the user. But there are also such databases that do this work independently, for example, RethinkDB , Hive and Pig . The graph database Neo4j also works with relationships — bypassing the relationships (edges) of a graph. Elasticsearch has the notion of join query time for parent / child relationships and join indexing time, which is implemented using the nested type .

Distribution. Usually, SQL databases are not distributed, and NoSQL, on the contrary, are distributed. There are also projects ( node.js NoSQL , ejdb ), similar to NoSQLite. However, a new generation of databases is seeking to ensure distribution in one way or another.

That is, it is impossible to precisely define the concept of NoSQL and relate Elasticsearch to the NoSQL repository. Already at the time of the creation of the article, nosql-database.org contained more than 20 similar databases.

Next we look at some important properties and see how Elasticsearch implements them.

No transaction

Lucene, on the basis of which Elasticsearch is built, has transaction support, although Elasticsearch does not have transactions in the usual sense of the word. That is, it is impossible to atomically send a sent document or work with a group of documents. But Elasticsearch has a write-ahead-log function that ensures the reliability of the operation and eliminates the need to use expensive Lucene-commit. You can also specify the level of consistency of indexing operations, that is, how many replicas must recognize the operation before returning the result. The default is quorum, i.e. n / 2 + 1.

Elasticsearch provides data manipulation and search in near real time. By default, one second elapses between indexing / updating / deleting data and the appearance of these changes in the search results. This distinguishes Elasticsearch from SQL systems, in which all changes are visible after the completion of transactions.

Optimistic competitive control (optimistic concurrency control) is carried out by specifying the version of the documents sent.

Elasticsearch server is designed for speed, but the implementation of distributed transactions is time consuming, and their absence makes the job easier. You can agree to receive somewhat outdated data, but everyone will observe the same timeline, and Elasticsearch cache will store a large amount of data, which makes this server as productive as possible, because they love it.

Data schema flexibility

Elasticsearch does not need to specify a data scheme in advance. It is enough to send a JSON document, and the server itself will perform the necessary operations to determine its type. This works well when it comes to numeric and logical data types and timestamps. For strings, a standard analyzer that is suitable for basic operations will be used.

The fact that “noncirculation” (in the sense that it is not necessary to determine the scheme itself) can be presented as a “flexible scheme” is debatable. To develop an excellent search and analytics system, you should design your own data scheme. For this, Elasticsearch has an extensive set of powerful tools, for example, dynamic templates, multi-field objects, etc. For more information on this, read the article about mapping.

Relationships and limitations

Elasticsearch - document-oriented database: the entire pool of objects for which you are going to do a search must be indexed, which means that before indexing, the documents must be denormalized. This increases extraction performance (since you do not need join-queries), requires more disk space (due to storage of redundant information), but at the same time ensuring consistency and relevance of data (any dimension affects all documents containing a variable object) becomes more difficult. However, this is an ideal option if the document needs to be saved once, and it will be read many times.

For example, you created a database in which you brought customers, orders and products, and now you want to find orders containing a specific product name and user name. This task is solved by indexing orders with all the necessary information about the user and products. The search operation is easy enough, but what happens if you want to change the product name? In relational databases with correct normalization, it is enough just to update the product - this is their convenience. But in the case of a denormalized document database, you will have to update every order with this product.

In other words, when working with document-oriented databases, like Elasticsearch, mapping is designed, and the documents are stored in the optimal form for searching and retrieving.

As mentioned earlier, Elasticsearch has the notion of join request time for parent / child and join indexing time based on nested type . In more detail, we will probably talk about this in the next article, but if you wish, you can familiarize yourself with the presentation by Martin van Groningen (Martijn van Groningen) " Document relations with Elasticsearch ".

Most relational databases also allow you to set constraints to determine what is consistent and what is not. For example, they can provide referential integrity and uniqueness, it is indicated that the amount of transactions with the account should be positive, etc. Document-oriented databases, as a rule, do not, and Elasticsearch is no exception.

Reliability or resistance to falls (robustness)

The database must be reliable, especially if it is the main repository of information. Ideally, it should be possible to cancel a resource-intensive request, and, of course, it should not stop working until you yourself want it.

Unfortunately, Elasticsearch, like the components from which it is built, is currently poorly handling OutOfMemory errors. We will dwell on this in more detail in the article “ Elasticsearch in Production, OutOfMemory-Caused Crashes ”. It is important to provide Elasticsearch with enough memory and to be careful before running queries with new unknown memory requirements on the production cluster.

Although this is likely to be corrected as Elasticsearch evolves, it should be remembered that Elasticsearch was created for high speed, and it was assumed that there would be an excess of RAM on the server.

Distribution

See also: Elasticsearch in Production, Networking

Before Shay Banon created Elasticsearch, he worked on Compass . At a certain point, he realized that it was too difficult to turn Compass into a distributed search engine, and started creating Elasticsearch from scratch. Elasticsearch is designed to be distributed and easily scaled to handle large amounts of data on available hardware.

Elasticsearch is incredibly easy to use and suitable for those who are just starting their work with distributed systems, although they are quite complex. We will discuss this point in more detail in the following sections.

The very nature of distributed systems implies that there are many things that can go wrong. In fact, various databases have different advantages: some strive for high stability, others for permanent availability, although they can return erroneous results for some or even a long time. In theory, the database rarely encounters problems and, if necessary, quickly solves them, as Kyle Kingsbury showed in his study of the risks of network separation into parts . He showed that while the database is working well, a large number of troubleshooting operations occur inside it.

In terms of consistency, availability and resilience to network failures, Elasticsearch is a CP-system (consistency & partition tolerance) for a rather weak definition of the term "consistency". If read-only operations dominate, Elasticsearch allows you to achieve AP-behavior (availability & partition tolerance) by decreasing the minimum master nodes parameter, that is, the absence of a quorum. However, it is usually necessary that most nodes in a cluster be available. Without this majority, writing to an improperly configured cluster, that is, a split brain cluster, can lead to irretrievable data loss. This is by no means specific to Elasticsearch and is also characteristic of other servers.

Elasticsearch has its own “master” selection algorithm. It is quite simple and not particularly stable, which, unfortunately, can cause big troubles in the real world of network problems. In Found, we manage hundreds of clusters and see the problem of choosing a “master” quite often, so we are actively working on moving the selection of a “master” to Zookeeper, which we have already used for many other purposes.

From the point of view of scaling, an index consists of one or several shards (shard), the number of which is indicated at the time of the creation of the index and after that cannot be changed. Thus, the index should be broken into shards in proportion to the expected growth. If more and more nodes are added to the Elasticsearch cluster, then it will competently redistribute and move the shards. So it can be said that Elasticsearch is easy to scale.

Security

See also: Elasticsearch in Production, Security

Elasticsearch has no ability to identify or authorize. You need to take into account that any user can connect to your Elasticsearch cluster and get superuser rights, especially if scripting is enabled.

Summary

Of course, Elasticsearch can be used as the primary repository if the above limitations are not a problem for you. A good example is Logstash , a fantastic log management tool. He stores them in Elasticsearch and has the ability to store them elsewhere. Logs are written once, and read with a lot. If there are no updates, then there is no need for transactions, integrity, etc.

What about systems like Postgres that support full-text search and ACID transactions (other examples are the full-text capabilities of MySQL, MongoDB, Riak, etc.)? In Postgres, you can implement a basic search, but it's worth mentioning the huge gap with Elasticsearch, both in performance and in other features. As discussed in the section on transactions, Elasticsearch can be cunning and use caching, without worrying about the multi version concurrency control and other things that complicate the work. Search is more than just finding a keyword in a section of text. We are talking about applying special knowledge to implement good relevance models, which give an overview of possible results and do things like spell checking and auto-completion, and doing all this very quickly.

Elasticsearch is usually used as a supplement to another, primary, database - with a strong focus on constraints, correctness and reliability, as well as transactionally updated. Accordingly, the data is first written to the main base, and then asynchronously - in Elasticsearch. Ensuring data synchronization will be discussed in more detail in the next article. In Found, we usually use ZooKeeper, as well as PostgreSQL as the main base, which we supplement Elasticsearch for excellent search.

As with everything else, there is no single database for managing all of your information. For a good job, you need to know all the strengths and weaknesses of your vault.

Recommended literature

Shay Banon: The Future of Compass & Elasticsearch // www.kimchy.org/the_future_of_compass

Ps. Thanks to the translation editor Anastasia Gordok .

Source: https://habr.com/ru/post/222765/

All Articles