Dremel. How does Google think in real-time?

Dremel is a scalable system for processing requests in near-real-time mode , designed to analyze immutable data [4].

The authors of research paper [4] (among which, apparently, and our compatriots - Sergey Melnik and Andrei Gubarev), which describes the basic principles and architecture of Dremel, declare that the system is capable of:

perform aggregate queries on more than one trillion lines in seconds;
scales to thousands of CPUs ;
designed to work with petabytes of data ;
has thousands of users inside Google (literally "at Google" [4]).

UPD1: Below is a picture of ~~deja~~ for attentive readers.

Dremel is Google's proprietary product, in operation since 2006, officially presented to the community at the Very Large Data Base (VLDB) Endowment conference in 2010.
')
Dremel is used in the tasks of analyzing documents collected by a search robot, tracking application installation data in the Android Market, Google Books services and spam analysis.

In 2012, Google opened access to Dremel for developers through the Google BigQuery service [14].

Data model

Like many systems of this class, Dremel has the ability to scale to several thousand nodes , runs on the commodity-equipment , provides reliable data storage , fault tolerance and replication , works with immutable data and has its own SQL-like query language .

Dremel uses the Google File System distributed file system for data storage, and tablet is the storage unit.

But, without a doubt, the key innovation of Google Dremel is the data model , which in [4] sounds like “ nested columnar storage ”, or, in translation, a column storage of nested data .

Basic principles of nested columnar storage: data whose schema is explicitly defined is stored in a column-striped storage repository along with its associated data.

Let us examine the advantages and disadvantages of such a choice.

Column storage allows you to read large amounts of data as well as efficiently compress (since data of the same type is stored in one place) compared with the storage oriented to line-by-line data storage. On the other hand, writing to column stores is more expensive than row-striped storages, and supporting transactional changes is not a trivial task.

Storing data together with associated (including nested) data as well as the column storage scheme allows you to increase the speed of reading data. At the same time, this storage method implies some data redundancy and, as a result, the need to solve the problem of consistency between different copies of this data.

Google engineers solved this problem as follows: Dremel works with immutable data, and the redundancy problem is solved by introducing two additional levels for each column: the definition level and the repetition level .

Source of illustration [4, Figure 2]

Source of illustration [4, Figure 3]

Each column contains a set of blocks. In turn, each block represents compressed data and integer values of the definition level and the repetition level . Thus, Dremel, knowing everything about the structure and location of the data that is necessary to fulfill the request, can obtain the necessary information “ on the spot ” (in the terminology [4] - “in situ”).

The coding of each column block using the <value, repetition_level, definition_level> scheme leads to the fact that the storage structure already contains an assembly algorithm, which is a finite state machine, where:

the set of states - the set of columns into which the document is divided;
transitions are repetition levels.

Source of illustration [4, Figure 4]

Execution of requests

To execute requests, Dremel uses a multi-level serving tree architecture: the root server receives a request from the client for execution, reads the required tablet metadata and sends (route) requests to the next levels of the tree. And so on until the request reaches the leaves of the service tree. The leaves of the tree interact directly with GFS to obtain the necessary data.

So different requests for Dremel are executed at the same time, Query dispatcher is engaged in scheduling requests based on their priority and the best load balancing. Query dispatcher is also responsible for handling failures at the tablet level, which became unavailable during the execution of the query and the “meleen” tablets.

An interesting solution is also the fact that the result of the query will not be returned after 100% of the records have been processed, but earlier - in [4] the figure is 98%, as the most typical. On the one hand, this saves Dremel from waiting for the completion of the “slow” leaves of the query tree; on the other hand, it leads to some error in the final result, which is quite acceptable for OLAP-systems and hardly acceptable in OLTP-systems.

Experimental results

As a result of experiments, the initial data of which is 3K computational nodes, 85 billion lines - Dremel fulfills requests for orders faster than MapReduce algorithms running on both record-oriented repositories and columnar-oriented repositories.

Source of illustration [4]

In addition, in [4] it is stated that during the experiments, some table scanning speed on some queries directly approached 100 billion records per second .

Impact on Open Source

The Open Source analogues of Dremel include Cloudera Impala , Apache Drill , Parquet (Twitter).

List of sources*

[4] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, et al. Dremel: Interactive Analysis of Web-Scale Datasets . Proceedings of the VLDB Endowment, 2010.
[14] Google BigQuery - Google Developers.
* A complete list of sources used to prepare the cycle.

Dmitry Petukhov
MCP, ~~PhD Student~~ , IT Zombies,
caffeinated man instead of red blood cells.

Source: https://habr.com/ru/post/207234/

All Articles