Configure replication between PostgreSQL and PipelineDB

This article may be of interest to those who already have a superficial understanding of the types and problems of replication within a PostgreSQL cluster, as well as those who decide to use the PipelineDB streaming database engine as a replica in such a cluster.

PipelineDB is one of the implementations of the now-growing streaming DBMS. You can easily read about the advantages of streaming database management systems in various cases today on a variety of resources. Very simply, the principle of their work is visualized on the website www.pipelinedb.com in the “How It Works” section.

Specifically, PipelineDB is a PostgreSQL fork with additional functionality that allows you to store only aggregated data, calculating the delta from the incoming stream (hence the name of this type of DBMS) on the fly. This data is stored in special PipelineDB objects, called continuous views. In the simplest case, the stream itself is formed from ordinary tables stored in the same database. Using this tool allows us to get rid of the need to create and maintain an ETL layer when preparing data for reporting systems, and can save you a lot of time and nerves. But I believe that since you are reading this, then you already know something about it in a volume sufficient for the appearance of interest in the events described here.
')
We will consider a case in which PostgreSQL 9.4+ DBMS is already working on our product environment, and we need to get its realtime (well, or almost realtime) replica in order to unload the main database from the multiple and heavy SELECT queries received from, for example, reporting systems, DWH or our data marts. And after studying the issue, you can decide that the streaming DBMS is very well suited for such a task.

But bad luck - what kind of replication mechanism to use? After additional study of the issue, we conclude that PostgreSQL's wonderful built-in asynchronous (physical) replication streaming mechanism, which appeared in PostgreSQL version 9.0 and is constantly evolving, does not fit due to its limitations, namely:

a) the master and the replica should have the same major version of PostgreSQL, and if possible, spin on identical hardware.
b) the replica works in “hot standby” mode, in which it is read-only
c) when using physical replication, we can only have a full "one-to-one" replica of the master

In my case, the first restriction would prevent me from raising a replica of the master server running Postgre 9.6 to PipelineDB. the Postgre version, from which the latest version of PipelineDB used by me, is only 9.5. If your wizard is running Postgre 9.5, then you can try this trick, but there is a high probability that the master server simply does not recognize PipelineDB as a complete and equal PostgreSQL - the mechanism of physical replication is very capricious in this regard.

The second limitation is more significant. As we found out, PipelineDB writes its data to the database. At least it is continuous views, for which we are confused with it. But the second restriction allows our cue to be only complete - one to one - a copy of the wizard database without the ability to write to it. That does not suit us at all.

Thus, since physical replication does not suit us, we understand that we need to look towards logical replication. Not without its drawbacks, but completely eliminating these two limitations, namely:

a) logical replication allows us to work with different versions of the DBMS on the master and the slave
b) logical replication allows you to make a replica of only the data we need, and not all master data, one-to-one, and does not block the write slave

And here before us the whole ocean of possibilities opens.

At the first acquaintance with the list of various tools for creating logical replication and a variety of methods of replication itself, the first desire that arises is the desire to change the type of activity. But the first shock passes, and we begin to catch from this ocean worthy candidates for the post of the instrument of our dreams.

A one-year article that discusses issues and replications and utilities that support it well, including: postgresql.leopard.in.ua/html/# replication

One of the most popular tools used for this is slony (trigger-based) and pgpool / pgpool-II (middleware).

I’ll say right away that the attempt to solve this problem with the help of the very well-known and popular Slony version 2.2.5 utility did not succeed in two weeks - even in the case when for proof-of-concept and master and replica tasks were running the same PostgreSQL version . The slony daemon stubbornly did not want to start and rebooted immediately upon startup due to the segmentation fault, the cause of which could not be found. Yes, and a thankless task to look for the causes of segmentation faults in third-party software. Moreover, the same picture was observed when compiling this utility from source codes and installing it from the native Alpine Linux repository.

This experiment was conducted with such starting data:
- docker containers
- both master and slave: postgre 9.6 on Alpine Linux

It is possible that the initial conditions themselves became the cause of failure - the use of the docker or specifically this Linux distribution - but in my case these were the rules of the game. I also admit that the problem could be hiding in the instability of the latest version of Slony itself, which I used. In any case, this decision did not work, and Slony went to bed. Perhaps in another system configuration or with another version of Slony, you will succeed.

However, after reading the article further, you may not want to pop in this ancient utility. And do not forget about it: howfuckedismydatabase.com/postgres/slony.php

I never got to the second pgpool utility, because on the way I found what eventually became my solution: the pglogical utility from 2ndQuadrant.

Reading the utility documentation and realizing who the 2ndQuadrant was immediately got me to this decision. Looking ahead, I would say that apparently this solution can even go into the upcoming 10 version of PostgreSQL as a standard solution for logical replication. So it was decided to play with him, moving in the queue to the study pgpool.

I decided to build the system right away on the components that I was supposed to send to production later, bypassing the proof-of-concept stage:
Master: docker-container from the same image as in the first case
Replica: PipelineDB docker-container from another image , which seems to be the official image of this project, although it is decorated in some strange way. The image is compiled based on the Debian distribution, not Alpine.

So, I started digging into pglogical. Almost immediately, I was bitterly disappointed: in the APT repository, this utility exists only for PostgreSQL versions 9.4, 9.5 and 9.6, and there are no PipelineDBs in it there. The utility flatly refused to be installed on the host with PipelineDB, reporting unmet dependency postgresql-9.5. So A wonderful experiment ended in essence and did not begin.

But the awareness of the fact that PipelineDB is still the same PostgreSQL - the structure of the base directories, configuration files, built-in commands and service utilities clearly demonstrated this - and that this should lead me to something positive, did not leave me. And I decided on a little trick.

To the host with PipelineDB, the pglogical utility is installed as follows (everything was done in a docker container under the root):

Add a repository and download utility packages:

echo "deb [arch=amd64] http://packages.2ndquadrant.com/pglogical/apt/ jessie-2ndquadrant main" > /etc/apt/sources.list.d/2ndquadrant.list wget --quiet -O - http://packages.2ndquadrant.com/pglogical/apt/AA7A6805.asc | apt-key add - apt-get update && apt-get download libpq5 postgresql-9.5-pglogical

Install the necessary libraries and the package itself with ignoring dependencies (!), Solving our problem of the utility’s reluctance to install on anything other than Postgre:

 dpkg -i --ignore-depends=postgresql-9.5 libpq5_9.4.10-0+deb8u1_amd64.deb dpkg -i --ignore-depends=postgresql-9.5 postgresql-9.5-pglogical_1.2.2-1jessie_amd64.deb

We delete the dependency record from the / var / lib / dpkg / status file so that when apt-get continues to work, it does not swear at the non-found dependency and does not suggest that we remove pglogical:

 sed 's/, postgresql-9.5//g' /var/lib/dpkg/status > /var/lib/dpkg/status-new && \ mv /var/lib/dpkg/status /var/lib/dpkg/status.bkp && \ mv /var/lib/dpkg/status-new /var/lib/dpkg/status

Everything! The utility is installed on the host, with PipelineDB. But again, bad luck - the utility is installed in folders named postgresql, and PipelineDB has a similar folder structure, but named pipelinedb. Well, let's not be discouraged about this and move the utility files to the appropriate folders already PipelineDB:

 mv /usr/lib/postgresql/9.5/lib/* /usr/lib/pipelinedb/lib/pipelinedb/ mv /usr/lib/postgresql/9.5/bin/* /usr/lib/pipelinedb/bin/ mv /usr/share/postgresql/9.5/extension/* /usr/lib/pipelinedb/share/pipelinedb/extension/

That's all. We got a working server with PipelineDB with the pglogical utility installed, which we can start using.

After a brief cluster setup, the master slave (configure the PipelineDB replica in the same way as regular PostgreSQL), described in millions of resources, including Postgre documentation, and after completing the simple configuration steps of the utility itself, we can verify that replication works.

UPD: Sources of configurations and initialization scripts for the docker can be found here: https://github.com/akrymets/pg-replication

I will be glad to hear comments on the essence and suggestions on the narration. The best offers will be implemented as edits to the article.

Source: https://habr.com/ru/post/322744/

All Articles

Configure replication between PostgreSQL and PipelineDB

More articles: