Databases and NoSQL (Chapter 4 of the book "Scalable and high-performance web applications")

In this chapter, we discuss databases, relational and NoSQL, which run on the same machine. It is this mode of operation that will be the brick on which the distributed databases are built.

Accumulation of knowledge

The most important criterion for choosing a database is the accumulated knowledge about it. For example, the relational model is several decades old, and during this time a huge amount of knowledge has been accumulated on how to fit this or that subject area into it. Common databases, such as MySQL, have accumulated a huge amount of information about its behavior in different conditions, as well as which systems can, in principle, be constructed on the basis of this database, either by changing it or adjusting it.

You should never discount this wealth. You should always understand that a new unknown system will have to go through a period of knowledge accumulation within each specific development team. How does the subject area fit into the data model of this database? How does this database behave in different conditions, with different patterns of access to it? How does it behave in case of equipment failure? How much can you push the scope of the base from the standard? Where can I get advice or use someone else's experience with the base, for free or for money?

Relational data model

The relational data model is several decades old. In a relational model, a base consists of tables that consist of rows and columns. Each column has its own data type (string, number, logical value, date, text, binary blob). All lines are of the same type.
')
Usually, each type of object is stored in a separate table (for example, a user table or a project table). Usually each object has a unique identifier. The identifier can be either conditional, i.e., just a number, as well as stemming from the subject area, such as the number of a person’s passport or ISBN for books. Usually use conditional identifiers.

Objects (tables) can be linked to each other using an identifier. For example, if we have a department table and an employee table, then in the department table there is a department identifier, and in the employee table there is an employee identifier and the identifier of the department to which he belongs. In the theory of relational databases, this case is called “one-to-many” (one department has many employees).

The “many-to-many” case is also possible. For example, there is a table of projects and a table of developers. Many developers can work on one project, and one developer can work on several projects. In this case, the third table is usually created - a table of relations with two fields: the project identifier and the developer identifier. Each relationship between the developer and the project is expressed as a string in the link table. If the developer has not yet been assigned to any project, then there will simply not be a single entry in the link table for him.

Relational database servers provide standard data access statements in tables, such as SELECT, INSERT, UPDATE, and DELETE. Different servers also provide some additional operators. You can retrieve data from tables by many different criteria. There is a “core” of the SQL standard, which is supported by almost all servers, and there are always some or other extensions of the standard that can be used when working with a specific database server.

Access optimization

The “stiffness” of the relational data model allows for various optimizations of data access. The most obvious example is creating indexes across table fields for quick access. For example, you can create an index on the “department ID” field on the employee table, and then the operation “get a list of employees of a particular department” will work faster. An index is simply a materialized data structure (see Chapter 1), such as a B-tree or a hash. It is important to understand how this data structure is structured so that conclusions can be drawn about how it will work in a given case.

An important role in the design of a relational database is played by the normalization and denormalization of the data model. The normal form of the database is where the information does not repeat. For speed and efficiency, sometimes the database is denormalized, and then duplicate information appears in it.

For example, we have a customer table and a sales table. Some customers are considered “important” because they bought for more than N. Each time we could extract a list of sales for each customer, sum up the cost and compare it with N. At the same time, for speed, we can add a field to the customer table - the checkbox is “important” and constantly maintain it in a holistic state - for example, when starting a new sale, check whether the total amount has become greater than N and, if so, set this flag to “TRUE”. In case of programming errors such fields can become out of sync and then the database has to be “repaired”.

Successful denormalization can greatly increase performance. However, denormalization is not a panacea, it can lead to negative consequences.

How to work effectively in a relational database with data structures such as a hierarchical tree or graph? For many years, accumulated vast experience in this area. For example, hierarchical trees for speed can be stored using the materialized path.
Key-value-style data can be stored both in the form of an obvious three-column “id_object-key-value” table, and (sometimes) as a “wide” table.

For some data structures, starting with a certain size, it is almost impossible to efficiently keep within a relational database, and you have to use specialized solutions. For example, the graph of friendly relations between a billion people is almost impossible to process using standard graph algorithms within the relational model, even with modern equipment.

Other data models

One of the meanings of the term “NoSQL” is a departure from the relational model in favor of more specific (or more generalized) data models. For example, traditionally successful NoSQL systems are key-value pair storage systems, such as Redis or Memcache. Their data model is extremely simple - it is essentially an associative array, where the keys are of string type, and the values can contain any data. Like any associative array, such systems support a limited set of data operations — read the value by key, set the key value, delete the key and its associated value. The operation “get the list of keys” may not be supported in such systems.

Another example of successful NoSQL systems is document repositories. Objects in such storages are usually associative arrays of free structure, that is, essentially different objects can be stored in the same “table”. Examples of systems of this class are MongoDB and Cassandra. Depending on what data is actually stored in a particular database, its performance can vary greatly. For example, if you optimize such a “table” by storing single-type objects in it,

The third example of specialized NoSQL-systems is graph databases. They are specially sharpened for processing a particular data structure, and usually for working with large amounts of data (because a standard relational implementation can do fine in small volumes).

A very important example of NoSQL systems are regular file systems, such as Ext4 or NTFS. They are designed to store objects in a hierarchical structure with free format content. The databases themselves, relational and NoSQL, usually use file systems to store their contents, and sometimes the interaction between these two subsystems becomes important in one way or another.

Another important case is full-text search systems such as Elastic Search or Google Search Engine.

Large volumes and complex algorithms

The principal problem of designing a system using databases is that practically any system works on relatively small amounts of data, and with relatively large amounts of data almost any system gradually stops working. This means that in the process of developing the system and increasing the volume of data, it is necessary to rethink working with data, change the data storage model, or even replace the database server with another one.

Traditionally it is believed that increasing the amount of data in each of the following order requires a redesign of the database. Sometimes they try to fight this by projecting the base two or three orders at a time, but this is not always possible to the full. The issue of working with increasing data is an unsolved engineering problem in the general case.

Another common problem is the suddenly emerging need to apply new algorithms to existing data, usually with high speed requirements. For example, a company stores all the information about sales of goods, suitable for accounting and monthly reports. However, the challenges of time require starting daily and hourly to analyze information on sales history and make business decisions based on this analysis - which stores send goods, what advertising campaigns to start, what else to offer to people who buy certain goods. Such algorithms may require to fundamentally change the way data is stored, while maintaining compatibility with the existing system and with existing data. The question of working in such conditions is an unsolved engineering problem.

Behavior in case of equipment failure

Any equipment will fail sooner or later: disks, memory, processor, power supply, etc. In this chapter we will consider the case of a single physical machine on which the server is spinning. Let the power of this physical machine suddenly disappear. After power is restored, it boots again and starts the database server. What happens to the data?

Each database system, relational and NoSQL, has its own strategy for handling such failures.

Generally speaking, a “zero strategy” is possible, when all data is simply lost and the database becomes empty. An example of a highly successful NoSQL system with this strategy is Memcache.

ACID: Atomicity, consistency, isolation and reliability

Database relational systems traditionally support one or another strategy that provides a set of guarantees called ACID: atomicity, consistency, isolation, durability (atomicity, consistency, isolation, reliability). These terms refer to transaction processing.

A transaction is a set of operations that are considered as a whole. A classic example of a transaction is the transfer of money between two bank accounts. To do this, we must reduce the amount on one account and at the same time increase the amount on another account.

Atomicity is an assurance that with any behavior of the equipment either both of these operations will be performed or none of them will be performed. That is, even if we “withdraw money from one account”, a voltage surge will occur in this microsecond - after rebooting the base and putting it into operation, we will again see the same amount on the original account.

Consistency is the least well-defined guarantee. In addition, this term is also used in the definition of the CAP-theorem (about which see below), and there it means something else (but close). Most generally, it can be said that consistency guarantees some “reasonable” behavior of the database, such that the programmer will not receive any special surprises when working with the database, as well as in case of equipment failures.

Isolation means that during the execution of a transaction, other concurrently performing operations “do not see” the intermediate state. For example, we calculated the total amount in the accounts. Now, if we start sending money, “withdraw money from one account” and in this microsecond another process tries again to calculate “total amount in accounts”, we will receive the same amount, and no less.

Durability means that after successful completion of a transaction, its results will not be lost under any circumstances. For example, we will execute a money transfer, close the transaction and receive a message from the server that the transaction has been successfully completed. In a microsecond, a power surge occurs. Reliability ensures that when the machine boots again and enters the working mode, the information about sending money will be saved in the database.

Traditionally, databases that support ACID, and allow to some extent to violate it, using the so-called. “Transaction isolation level” (transaction isolation levels). For example, at the “uncommitted read” level, parallel transactions can “see” the intermediate states of other transactions.

Loosening guarantees

In general, the weakening of guarantees often makes it possible to increase the efficiency at the price of special requirements for the interpretation of results (for example, they may not be accurate enough or simply incorrect). For some cases, this may be justified: for example, if we want to show the total number of registered users on a site, then in general we are not interested in the most accurate value - just say “about a hundred” or just show “some” number, because no one ever will be able to confirm or deny each specific value.

Many NoSQL systems simply refuse to support ACID, and instead declare some sort of their own unique set of guarantees, which can be anywhere in the spectrum from “zero” to more or less close to “full ACID”. For example, some versions of some systems may simply leave the database in a damaged state when a machine fails, so manual or semi-automatic recovery will be required after a reboot, and it is not guaranteed that all recorded data will be saved.

The guarantees weakened at the level of an individual machine can be “restored” or even built on their basis a substantially more reliable system, if you combine physical machines into a network and require special operation mode with them. For more on this, see below.

Parallel database access

Typically, a database has many clients that simultaneously perform both read and write operations. The database is obliged in this case to fulfill the guarantees that are laid in it. For example, relational databases usually provide transaction isolation (see above).

Maintaining concurrent access to databases often requires substantial efforts from the server developer, who must ensure the speed and reliability of such access. There are many different algorithms and data structures underlying parallel access.

For example, to add an entry to a table, we need to select a new page in the table, as well as update the index. If at the same time another client adds another entry, then he needs to select another page (or use the same?), And update the index again (or maybe combine two index update operations?). What if the first client started a transaction, announced adding an entry, waited two seconds, and rolled back the transaction? What if one client increased the value of the field by one, and the second one reduced it? What if a power surge can occur in any microsecond, and the system must revert to the “correct” state after a reboot, despite all the numerous combinations of intermediate conditions and conditions?

Compliance with guarantees in terms of parallel access while maintaining performance is a huge and complex engineering task. All database servers solve it using more or less standard approaches, but the specific implementation of these approaches and the subtleties associated with them are different in each database server.

Traditionally, it is believed that increasing by an order of magnitude the number of simultaneous database clients requires a revision of its architecture. In general, this question is an unsolved engineering problem.

Administrative functions

All database servers provide many administrative functions related to the life of the server on a separate machine. Among these features - backup; restore from backup; optimization of space occupied by tables; distribution of data files on various disks and file systems; network access to the database server (see also the relevant chapter of the book) and the effectiveness of such access.

Also, some servers are able to effectively use special functions of the operating system (often, in turn, designed specifically for database servers). A typical example is asynchronous I / O support.

There are also administrative functions related to the integration of physical machines into a network. For example, this is setting up the replication topology, as well as managing the machines in clusters. See below for details.

All these and many other functions are implemented in their own way in different database servers. Administrative functions, their elaboration and convenience, are an important criterion for choosing a database server suitable for a specific task.

Modern database servers provide many performance tweaks. Comparing the speed of different databases in special conditions is a fascinating and not always meaningful exercise.

It is important to understand that any server configured under certain conditions can always be “put on its knees” by changing the data access patterns, increasing the number of clients or increasing the amount of stored data. Data access patterns change as the system evolves. The number of customers is growing as the popularity of the system grows. The amount of data stored usually also increases as the system evolves. All this leads to the fact that the old records and successes become irrelevant, and it is required to re-conduct the process of fine tuning the system, and sometimes think about changing the data access architecture.

Distributed Databases

As we have said, every physical machine at any time can break. In addition, any physical machine has a performance limit that it can provide. These two circumstances make networking machines and treat them as a distributed database.

Distributed databases force you to re-think about all the issues that we discussed for the case of a single physical machine: the data model, the data access protocol and the guarantees that are provided in case of equipment failure.

We will discuss this issue in more detail in the next chapter of our book.

Source: https://habr.com/ru/post/233761/

All Articles