Spanner NewSQL storage from Google

Spanner is a geographically distributed, highly scalable multi-version database with support for distributed transactions. The repository was developed by Google engineers for the corporation’s internal services. Research paper [8], which describes the basic principles and architecture of Spanner, was presented at the scientific conference of the 10th USENIX Symposium on Operating Systems Design and Implementation in 2012.

Spanner is an evolutionary development of NoSQL predecessor - Google Bigtable . C Spanner itself belongs to the family of New SQL solutions. The research paper [8] states that the Spanner design allows the system to scale to millions of compute nodes across hundreds of data centers and work with trillions of data lines .

')
Spanner uses Colossus (GFS of the new generation) as a storage layer (storage) and the Paxos collision resolution algorithm. In turn, on the basis of ( on top ) Spanner, a distributed Google F1 database is built.

Spanner is used in the social network Google+, in the mail service GMail. Spanner built on the basis of the Google F1 Database used at the time of publication [8] in the Google Ad Service.

Basic principles

The data in Spanner is stored in semi-relational tables with a schema . All data have a version - timestamp ( timestamp ) to confirm the recording of this data (commit). Spanner has a SQL-like query language, the ability to configure the number of replicas and the Garbage Collector policy responsible for deleting records with “old” timestamps.

In addition to the “usual” possibilities for the NoSQL world, Spanner has a number of properties that are difficult to implement in distributed systems. Such as:

distributed transaction support;
global consistency of read operations between geographically distributed DCs, thus the data that returns read operations from different DCs are always consistent and consistent.

In addition, Spanner has capabilities that are more typical for DBMSs, such as:

non-blocking reading of data "from the past" (in past);
no locks for read-only transactions;
atomic change in the schema of the data tables;
synchronous replication;
automatic processing of failures of both compute nodes and DC;
automatic data migration both between compute nodes and between DCs.

Architecture

Each deployed instance ( deployment ) Spanner is named Universe and contains:

Universe master - a master process that coordinates the work of multiple zones (in the Spanner- Zone terminology);
The Zone is a geographically distributed (in general) Spanner zone. Zone is a unit of both logical isolation and physical.

Each of the Zone, in turn, contains:

ZoneMaster - zone master process (singleton);
Many - from hundreds to several thousand - Spanservers ;
Location proxy - reveal to customers the location of the Spanservers responsible for the necessary data;
Placement driver is a process (as well as Zonemaster, Singleton) that controls the movement of data between different Zone.

The research paper [8] describes in detail the functions and the internal device only Spanserver.

Each Spanserver contains from 100 to 1000 data structures called tablet .

(key: string, timestamp: int64) -> string

Unlike Bigtable, Spanner adds the timestamp for adding this data to the structure of the stored data, which is an important introduction to support multi-versioning of the data.

Spanner's data model is a semi-relational table with support for data schemas (schematized), SQL-like query language and distributed transactions.

The implementation of the latest (transactions) has become possible thanks to one of the most innovative innovations for this kind of software systems - the TrueTime API .

Truetime

A common task for systems providing the global time (in particular the atomic clock) is to provide the most accurate time. TrueTime API also provides clients with global time + some uncertainty - TTinterval.

This is necessary because for distributed systems it is very difficult to guarantee the instantaneous response of the nodes, which is important for ensuring consistency of data in the distributed storage.

At the approach, when instead of the exact time, some time interval of performing two competitive transactions is reduced (simplified) to the comparison TTinterval of these transactions. If TTinterval transactions do not overlap, then you can definitely know which of the transactions should be performed before. If TTinterval intersects, then you can only say with a certain degree of probability. ( Read more about the TrueTime hardware.)

In the Spanner itself, the consistency of data during transactions is provided by the two-phase transaction commit protocol (2-Phase Commit Protocol) implemented using the Paxos algorithm.

Restrictions and CAP

At the time of publication of the research paper [8], Spanner did not support secondary indexes and automatic resharding for load balancing. In addition, the authors of [8] note that Spanner is not able to efficiently execute complex SQL queries .

Spanner is also not a "refutation" of the CAP-theorem. Spanner is not an AP system, despite its NoSQL nature; as well as not a CA system, despite ~~support~~ commitment to supporting the principles of ACID. Spanner "sacrifices" availability ( availability ) to maintain data integrity ( consistency ) and therefore is a CP-system.

Results

Spanner took the best ideas of two worlds - relational DBMS and NoSQL - and is the NewSQL generation DBMS .

Support for distributed transactions between data centers on petabyte data volumes, with this ability to scale, is certainly an extremely impressive feature for any structured and semi-structured data storage system. This possibility was largely a consequence of the symbiosis of two approaches: an approach to data storage — the data is immutable and contains a commit time stamp — and the innovative concept for obtaining global time — TrueTime.

List of sources*

[8] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, et al. Spanner: Google's Globally-Distributed Database. Proceedings of OSDI, 2012.
* A complete list of sources used to prepare the cycle.

Dmitry Petukhov
MCP, ~~PhD Student~~ , IT Zombies,
caffeinated man instead of red blood cells.

Source: https://habr.com/ru/post/207082/

All Articles