We continue a series of articles on technologies used in the work of the data warehouse (Data Warehouse, DWH) of our bank. In this article I will try to briefly and a little superficially talk about Greenplum - a DBMS based on postgreSQL, which is the core of our DWH. The article will not provide installation logs, configs, etc. - and without this, the note turned out to be quite voluminous. Instead, I will talk about the general DBMS architecture, methods for storing and filling data, backups, and also list several problems we encountered during operation.

A little bit about our installations:
')
- the project lives with us a little more than two years;
- 4 circuits from 10 to 26 cars;
- database size is about 30 TB;
- in the database about 10,000 tables;
- up to 700 queries per second.
For how it works, please under the cat!
1. General Architecture
So, Greenplum (GP) is a relational DBMS that has massive parallel processing without Shared Nothing. For a detailed understanding of how GP works, you need to define basic terms:
Master instance (also known as “master”) - Postgres instance, which is both a coordinator and an entry point for users in a cluster;
Master host ("server master") - the server on which the Master instance is running;
Secondary master instance - Postgres instance, which is a backup master, is included in the work in case the main master is unavailable (switching occurs manually);
Primary segment instance ("segment") - Postgres instance, which is one of the segments. It is the segments that directly store data, perform operations with them, and give the results to the master (in general). In fact, a segment is the most common PostgreSQL 8.2.15 instance with configured WAL replication to its mirror on another server:
/app/greenplum/greenplum-db-4.3.5.2/bin/postgres -D /data1/primary/gpseg76 -p 50004 -b 126 -z 96 --silent-mode=true -i -M quiescent -C 76
Mirror segment instance (“mirror”) - the Postgres instance, which is a mirror of one of the primary segments, automatically assumes the role of primary in the event of a fall:
/app/greenplum/greenplum-db-4.3.5.2/bin/postgres -D /data1/mirror/gpseg76 -p 51004 -b 186 -z 96 --silent-mode=true -i -M quiescent -C 76
GP supports only 1-to-1 segment replication: for each primary there can be only one mirror.
Segment host ("server-segment") - a server that runs one or more segments and / or mirrors.
In general, a GP cluster consists of several server segments, a single master server, and a single master server, interconnected by one or several fast (10g, infiniband) networks, usually interconnect:
Fig. 1. The composition of the cluster and the network interaction of elements. Here - green and red lines - separate interconnect networks, blue line - external, client network.The use of several interconnect-networks allows, firstly, to increase the capacity of the channel of interaction between the segments between themselves, and secondly, to ensure the fault tolerance of the cluster (in the event of a failure of one of the networks, all traffic is redistributed among the remaining ones).
When choosing the number of server segments, it is important to choose the cluster ratio “number of processors / TB of data” depending on the planned load profile on the database - the more processor cores are in the data unit, the faster the cluster will perform “heavy” operations, and also work with compressed tables.
When choosing the number of segments in a cluster (which is generally not tied to the number of servers), remember the following:
- all server resources are divided between all segments on the server (the load of mirrors, if they are located on the same servers, can be conditionally ignored);
- each request on one segment cannot consume processor resources more than one CPU core. This means, for example, that if a cluster consists of 32-core servers with 4 GP segments onboard and is used on average to handle 3-4 simultaneous heavy, well-utilized CPU requests, the “hospital average” CPU will not recycle optimally. In this situation, it is better to increase the number of segments on the server to 6-8;
- The regular backup and restaraunt data process “out of the box” works only on clusters that have the same number of segments. It will be impossible to restore data stored on a cluster of 96 segments into a cluster of 100 segments without a file.
2. Data storage
The Greenplum implements the classic data sharding scheme. Each table is N + 1 tables on all segments of the cluster, where N is the number of segments (+1 in this case is the table on the master, there is no data in it). Each segment contains 1 / N rows of the table. The logic of partitioning a table into segments is defined by a distribution key (field), a field on the basis of which data any line can be assigned to one of the segments.
The key (field or set of fields) of distribution is a very important concept in GP. As mentioned above, Greenplum operates at the speed of the slowest segment, which means that any bias in the amount of data (both within one table and across the entire database) between segments leads to degradation of cluster performance, as well as other problems. That is why one should carefully select the field for distribution - the distribution of the number of occurrences of values ​​in it should be as even as possible. Whether you have chosen the distribution key correctly will be prompted by the service field gp_segment_id, which exists in each table - it contains the number of the segment on which the specific row is stored.
An important caveat: GP does not support UPDATE fields over which the table is distributed.Consider example 1 (hereinafter in the examples the cluster consists of 96 segments) In both cases, we distributed the table across the num_field field. In the first case, we inserted 20 unique values ​​into this field, and, as you can see, the GP laid out all the lines into different segments. In the second case, 20 identical values ​​were inserted in the field, and all lines were placed on one segment.
In case the table does not have suitable fields for use as a distribution key, you can use random distribution (DISTRIBUTED RANDOMLY). The field for distribution can be changed in an already created table, but after that it must be redistributed.
It is through the distribution field that Greenplum performs the most optimal JOINs: if the fields for which the JOIN is made are distribution keys in both tables, the JOIN is executed locally on the segment. If this condition is not true, GP will have to either redistribute both tables to the desired field, or throw one of the tables entirely on each segment (operation BROADCAST) and then then join the tables locally on the segments.
Example 2: Joyne by distribution key Joyne is not a distribution key As you can see, in the second case, two additional steps appear in the query plan (one for each of the tables participating in the query):
Redistribute Motion . In fact, before executing the request, GP redistributes both tables into segments using the logic of the num_field_2 field, and not the initial distribution key, the num_field field.
3. Customer interaction
In general, all client interaction with the cluster is conducted only through the master — it is he who responds to clients, gives them the result of the query, etc. Regular client users do not have network access to segment servers.
To speed up the loading of data into the cluster, bulk load is used - parallel loading of data from / to the client simultaneously from several segments. Bulk load is only possible from clients with access to interconnects. Usually such clients are ETL servers and other systems that need to download a large amount of data (in Figure 1 they are designated as
ETL / Pro client ).
For parallel data loading into segments, the gpfdist utility is used. In fact, the utility raises a web server on a remote server that provides access via the gpfdist and http protocols to the specified folder:
/usr/local/greenplum-loaders-4.3.0.0-build-2/bin/gpfdist -d /tmp/work/gpfdist_home -p 8081 -l /tmp/work/gpfdist_home/gpfdist8081.log
After launch, the directory and all files in it become accessible to the usual wget. For example, let's create a file in the directory served by gpfdist and treat it as a regular table.
Also, but with a slightly different syntax, external web tables are created. Their peculiarity is that they refer to the http protocol and can work with data provided by third-party web servers (apache, etc.).
Separately, it is possible to create external tables for data lying on a distributed Hadoop filesystem (hdfs) - a separate gphdfs component is responsible for this in GP. To ensure its operation on each server that is part of a GP cluster, it is necessary to install the Hadoop libraries and assign a path to them in one of the system database variables. Creating an external table that accesses data in hdfs will look something like this:
db=
Here hadoop_name_node is the address of the hostnames, /tmp/test_file.csv is the path to the desired file on hdfs.
When referring to such a table, Greenplum finds out at Hadoop's NEAMNODs the location of the necessary data blocks on datas, which are then accessed from segment servers in parallel. Naturally, all the nodes of the Hadoop cluster must be in the interconnect networks of the Greenplum cluster. Such a scheme of work allows to achieve a significant increase in speed, even compared with gpfdist. Interestingly, the logic of selecting segments for reading data from danod hdfs is very nontrivial. For example, the GP can start to pull data from all the data points with only two segment servers, and with a repeated similar request, the interaction scheme may change.
There is also a type of external tables that refer to files on a segment server or a file on a master, as well as the result of executing a command on a segment server or on a master. By the way, the good old COPY from has not gone away and can also be used, but compared to what was described above, it works more slowly.
4. Reliability and redundancy
4.1. Master Backup
As mentioned earlier, the GP cluster uses full wizard redundancy using the replication mechanism of transactional logs controlled by a special agent (gpsyncagent). In this case, automatic switching of the master role to the backup instance is not supported. To switch to the backup master you need:
- make sure that the master master is stopped (the process is killed and the postmaster.pid file is missing in the working directory of the master instance)
- on the backup wizard server, run the gpactivatestandby -d / master_instance_directory command
- switch the virtual ip-address to the server of the new wizard (there is no virtual ip mechanism in Greenplum, you need to use third-party tools).
As you can see, switching is not difficult at all and can be automated when taking certain risks.
4.2. Segment reservations
The reservation scheme of the segments is similar to that for the master, the differences are quite small. If one of the segments fails (the postgres instance stops responding to the master during the timeout), the segment is marked as failed and its mirror is automatically launched instead (in fact, absolutely similar to the postgres instance). Replication of the segment data in its mirror is based on WAL (Wright Ahead Log).
It is worth noting that a rather important place in the planning process of the GP cluster architecture is the location of the segment mirrors on the servers, since the GP gives complete freedom in choosing the locations of the segments and their mirrors: using a special segment map, they can be placed on different servers different directories and make use of different ports. Consider two boundary options:
Option 1: all mirrors of segments located on host N are on host N + 1In this case, if one of the servers on the server-neighbor fails, it turns out to be twice the number of running segments. As mentioned above, the cluster performance is equal to the performance of the slowest segment, and therefore, in the event of a single server failure, the performance of the database is at least halved.
However, such a scheme also has positive aspects: when working with a failed server, only one server becomes the weak point of the cluster - the same one where the segments have moved.
Option 2: all the mirrors of the segments located on the host N are uniformly “smeared” on the servers N + 1, N + 2 ... N + M, where M is the number of segments on the serverIn this case, in the event of a server failure, the increased load is evenly distributed among several servers, not much affecting the overall performance of the cluster. However, the risk of failure of the entire cluster is significantly increased - it is enough to fail one of the M servers adjacent to the failed one initially.
The truth, as often happens, is somewhere in the middle - you can arrange several mirrors of segments of one server on several other servers, you can combine servers into fault tolerance groups, and so on. The optimal configuration of the mirrors should be selected on the basis of specific cluster hardware data, idle severity, and so on.
Also in the mechanism of redundancy segments there is another nuance that affects the performance of the cluster. In case of failure of a mirror of one of the segments, the latter goes into change tracking mode — the segment logs all changes so that, when recovering the fallen mirror, apply them to it, and get a fresh, consistent copy of the data. In other words, when a mirror falls, the load created on the server’s disk subsystem by a segment without a mirror increases significantly.
When eliminating the cause of a segment failure (hardware problems, expired storage space, etc.), it must be returned to work manually using the special utility gprecoverseg (DBMS downtime is not required). In fact, this utility will copy the WA logs accumulated on the segment onto the mirror and pick up the fallen segment / mirror. If we are talking about the primary segment, initially it will be included in the work as a mirror for its mirror, which has become primary (the mirror and the main segment will work by exchanging roles). In order to bring everything back to normal, it will require a rebalance procedure - changing roles. Such a procedure also does not require downtime of the DBMS, but for the time of rebalance all sessions in the database are suspended.
If the damage of a fallen segment is so serious that simply copying data from WA-logs is not enough, it is possible to use a full recovery of a fallen segment - in this case, the postgresql instance will be created anew, but due to the fact that the restoration will be not incremental, the recovery process may take a long time.
5. Performance
Evaluating the performance of the Greenplum cluster is a rather loose concept. I decided to start with the tests performed in this article:
habrahabr.ru/post/253017 , since the systems under consideration are in many ways similar. Since the cluster under test is significantly (8 times only by the number of servers) more powerful than the one given in the article above, we will take 10 times more test data. If you would like to see the results of other cases in this article, write in the comments, if possible I will try to test.
Initial data: a cluster of 24 segment servers, each server - 192 GB of memory, 40 cores. Number of primary segments in a cluster: 96.
So, in the first example, we create a table with 4 fields + a primary key for one of the fields. Then we fill the table with data (10,000,000 rows) and try a simple SELECT with several conditions. I remind you that the entire test is taken from the
Postgres-XL article.
Test 1: SELECT with conditions As you can see, the query execution time was 175 ms. Now let's try an example with a join on the distribution key of one table and on the usual field of another table.
Test 2: JOIN on the distribution key of one table and on the normal field of another table 4.6 . – .
6.
, , , (, , ). , « / », .1. , .
Greenplum , . :
- , GP ;
- ( );
- ;
- ( );
- gpexpand ( 5 10 );
- ( );
- (redistribute) ;
- (analyze) .
, , 20-30 .
7.
, . , GP. , postgresql ( vacuum', WAL-) .
- failover 100%
, , , . , . . - Greenplum OLTP
GP — , , . ( 600 queries per second) /, , - — N . update/insert .
, Greenplum postgresql DB, SQL postgresql- .- «»
, tablespace, c GP 4.3.6.0 (, gphdfs, Hadoop). - Greenplum Postgresql 8.2.15, .
8.
Greenplum — . , enterprise-level Data Warehouse ( — GP). postgresql Greenplum Data Warehouse DB.
, , — 17 2015 Pivotal
, Greenplum open source ,
Big Data Product Suite .
UPD 28.10.2015. github:
github.com/greenplum-db/gpdb« »: 12
Dell EMC , Pivotal.