Data access speed: the battle for the future

For a long time, mankind has been engaged in collecting information, analyzing and storing it in some form, so that it can be passed on to descendants. The evolution of our consciousness could become possible largely because of this — the new generation of people did not have to comprehend what had already been comprehended before them. Starting with the most ancient information carriers - Egyptian papyrus and Sumerian cuneiform tablets, humankind accumulated more and more information. In the history of mankind there were times when, as a result of wars and cataclysms, part of the already accumulated knowledge was destroyed or disappeared, and then progress stopped, and humanity was thrown back in its development. The real revolution and breakthrough was the discovery of the technology of mass publishing, which made it possible to spread information to a large audience, which in turn led to an explosive growth in the sciences, art, and also brought the consciousness of all humanity to a higher level. The development of technology in the twentieth century led to the emergence of new information carriers - punched cards, punched tapes, hard magnetic disks, etc. More and more information was transferred from ledgers to electronic media. There was a need to organize and manage access to this data - so the first DBMS appeared.

The relational data model proposed in 1970 by E.F. Coddom, for a long time set the trend in the development of databases and allowed to fully meet the requirements of the business until now. Since 1970, relational databases have come a long way and have taken many challenges that come their way. Constantly growing volumes of data have led to the emergence of methods capable of providing faster access to the necessary data — indices, storing data in sorted form, and so on. These methods quite successfully coped with their task, and still have not lost their relevance. However, the rapid increase in storage media and cheaper storage costs has led to the fact that database volumes of tens of terabytes are no longer unusual and are perceived as commonplace. Business cannot allow this data to be a “dead weight”, as ever-increasing competition in the world forces it to look for new approaches to mastering the sphere of its activity, because, according to the popular expression, “Who owns the information, he owns the world.” If we talk about time, then the score goes not for days, or even hours, but rather for minutes - whoever can quickly obtain the necessary information will win.

But not all modern databases are ready for new volumes - the old methods are no longer so effective. The main component that “slows down” the entire database as a whole is an information storage device. Unfortunately, it is the capabilities of the hard disk that are now in the further development of ways to get useful information from a huge data set of tens of terabytes. To date, technologies are not keeping pace with the growth in the amount of data that needs to be analyzed. Flash drives are quite expensive, and have significant shortcomings, in particular, a write resource that prevents them from being used as corporate data storage devices for databases. In this article, I propose to discuss the methods used by modern analytical databases to overcome the shortcomings of existing technologies. I would also like to leave the discussion of the NoSQL-rich family of databases for a separate article, so as not to introduce confusion into existing approaches. Still, databases with the NoSQL model are still quite exotic for traditional analytical systems, although they have gained some popularity in certain tasks. The main interest nevertheless is caused by databases with a traditional relational data model meeting the requirements of ACID and intended for Big Data analytics - how they respond to a modern challenge.

It is perfectly clear that the data used by analytical databases should be adequately prepared and streamlined, since it is difficult to isolate any patterns from chaos. However, there are exceptions to this rule, which we will discuss later; there may be another article in the framework. Assume that the data are prepared by some ETL process and are loaded into the data warehouse. How modern analytic databases can provide such a speed of access to data that would allow them not to spend several days reading several terabytes or tens of terabytes.
')

Massive parallel processing

A massively parallel architecture is built from individual nodes, where each node has its own processor, memory, and communication tools that allow it to communicate with other nodes. A separate node in this case is a separate database that works in conjunction with all the others. Thus, if you go down a notch, we will see a process or a set of processes, which is a database, and performing its own separate task, contributing to the common cause. Due to the fact that each node has its own infrastructure (processor and memory), we do not rest on the traditional limitations specific to databases, which are in essence the only node with access to the entire volume of stored data. Here, each node stores its own portion of data and works with it, providing the fastest access. How to distribute the entire amount of data evenly across all nodes is a topic for a separate discussion.

Theoretically, for each node we can provide our own processor and disk memory, so the maximum data reading speed will be equal to the sum of all reading speeds for information storage devices, allowing us to achieve acceptable results regarding the response time of those requests where it is necessary to analyze ultra-large volumes. information. Practically, to achieve greater utilization on a single server, several nodes live, sharing resources among themselves.

Obviously, for databases built on such an architecture, one of the nodes or each of them should be able to accept a request from the user, distribute it to all remaining nodes, wait for a response from them and pass this answer to the user as a result of executing his request.

The advantages of this architecture lie on the surface - it is almost linear scalability. Moreover, the scalability is both horizontal and vertical.

The disadvantage is that it is necessary to make considerable efforts to create software that allows you to use all the advantages of such an architecture, and this is the reason for the high cost of such products.

Analytical relational databases using MPP:

1. EMC Greenplum
One of the best solutions, with a powerful set of features that provide the ability to configure it for any task.
2. Teradata
Well-known solution, well proven in the market. High cost compared to competitors, not due to significant advantages.
3. HP Vertica
Advantages of the solution are at the same time and its disadvantages. A large amount of redundant (duplicate) data that needs to be stored, focus on a narrow range of tasks, the absence of some important functionality.
4. IBM Netezza
An interesting and fairly quick solution. The disadvantages are that this fully hardware solution, built on a proprietary platform, is partially obsolete. There are questions about scalability.

For each decision, you can make a separate review, if readers have an interest in the future. Anyway, these four products set the trend in the sector of MPP solutions with a shared nothing architecture. Using their example, we can observe the vector of development of further technologies aimed at processing extra-large amounts of data. A completely new class of databases has appeared, designed specifically for processing dozens of terabytes.

However, a second direction appeared, allowing to circumvent the limitations associated with the capabilities of hard drives.

IMDB

In-memory database - a database that works with data that is completely in memory. As you know, RAM has a speed much higher than ordinary hard drives, thereby providing a high-performance storage device with a huge speed of reading as well as writing. Despite this, there are only a few people who want to store their data completely in RAM. This is due to the fact that the cost of such memory is much higher than the cost of hard drives. An important factor is the fact that all data disappears as soon as the electricity is turned off. For a long time, databases working with in-memory data were auxiliary and served as a buffer, storing short-term data necessary only for on-line processing. However, the decline in the cost of this type of memory spurred interest in databases of this type. It is believed that entering the Big Data area begins with a terabyte. Until recently, among the databases of this type there were no solutions that could work with a sufficiently large volume. However, in 2011, SAP introduced its HANA database, which supported up to 8 terabytes of uncompressed data. Theoretically, using compression, the amount of data used can be raised to 40 terabytes. Another representative of the IMDB technology is TimesTen solution from Oracle. Both solutions have great functionality and are the most advanced products in the field of In-Memory RDBMS.

Thus, companies are ready to accept the challenge of Big Data. Already invented and tested solutions that allow for an acceptable time to get an answer to the questions posed by analysts, marketers and database managers, using the information accumulated over the decades. New database classes are being created, designed to handle very large amounts of data. New methods are being developed to improve data access speed.

At the same time, modern realities show that relational databases cannot be all-in-one solutions. That is why, in addition to the data warehouse that performs the function of storing and processing large volumes of information, the company should also have OLTP databases designed for conducting the company's operating activities.

Source: https://habr.com/ru/post/147743/

All Articles

Data access speed: the battle for the future

Massive parallel processing

IMDB

More articles: