Highly reliable PostgreSQL cluster based on Patroni, Haproxy, Keepalived

Hi, Habr! I recently got up a task: to configure the most reliable cluster of PostgreSQL 9.6 servers.

As planned, I wanted to get a cluster that is experiencing the loss of any server, or even several servers, and is able to automatically commission servers after crashes.

Planning a cluster, I studied many articles, both from the main documentation for PostgreSQL, and various howto, including from Habr, and tried to set up a standard cluster with RepMgr, experimented with pgpool.
')
In general, it worked, but I occasionally had problems with switching, I needed manual intervention to recover from accidents, etc. In general, I decided to look for more options.

As a result, somewhere (no longer remember exactly where) I found a link to the wonderful project Zalando Patroni , and wrap everything up ...

Introduction

Patroni is a python daemon that allows you to automatically serve PostgreSQL clusters with different types of replication and automatic role switching.

Its special beauty, in my opinion, is that distributed DCS storages are used to maintain the cluster's relevance and choices (supported by Zookeeper, etcd, Consul).

Thus, the cluster is easily integrated into almost any system, you can always find out who is the master at the moment, and the status of all servers with requests in DCS, or directly to Patroni via http.

Well, it's just beautiful :)

I tested the work of Patroni, tried to drop the wizard and other servers, tried to pour different bases (~ 25 GB base automatically rises from scratch to 10GB network in a few minutes), and in general I really liked the Patroni project. After the complete implementation of the scheme described below, I conducted testing with a simple bencher, who went to the database at a single address, and experienced the crashes of all cluster elements (server master, haproxy, keepalived).

The delay in transferring the role to the new master was a couple of seconds. When the former master returns to the cluster, or a new server is added, the roles are not changed.

To automate the deployment of the cluster and add new servers, it was decided to use the familiar Ansible (I will give links to the resulting roles at the end of the article). As the DCS is the Consul already used in our country.

The article has two main goals: to show PostgreSQL users that there is such a wonderful thing as Patroni (there are practically no references in Runet in general and in Habré in particular), and at the same time share some experience using Ansible using a simple example for those who are just starting to work with it.

I will try to explain all the action at once on the example of the analysis of Ansible roles and playbooks. Those who do not use Ansible will be able to transfer all actions to their favorite automated server management tool, or perform them manually.

Since most of the yaml scripts will be long, I will wrap them in a spoiler.
The story will be divided into two parts - preparing the servers and deploying the cluster directly.

For those who are familiar with Ansible, the first part will not be interesting, so I recommend to go directly to the second.

Part I

For this example, I use Centos 7 based virtual machines. Virtual machines are deployed from a template that is periodically updated (kernel, system packages), but this topic is beyond the scope of this article.

I will only note that no application or server software on virtual computers is pre-installed. Also any cloud resources are suitable, for example with AWS, DO, vScale, etc. For them, there are scripts for dynamic inventory and integration with Ansible, or you can fasten Terraform, so that the whole process of creating and deleting servers from scratch can be automated.

First you need to create an inventory of used resources for Ansible. Ansible is (and by default) located in / etc / ansible. Create an inventory in the / etc / ansible / hosts file:

[pgsql] cluster-pgsql-01.local cluster-pgsql-02.local cluster-pgsql-03.local

We use the internal domain zone .local, so the servers have such names.

Next, you need to prepare each server to install all the necessary components and working tools.

For this purpose we create a playbook in / etc / ansible / tasks:

/etc/ansible/tasks/essentialsoftware.yml

 --- - name: Install essential software yum: name={{ item }} state=latest tags: software with_items: - ntpdate - bzip2 - zip - unzip - openssl-devel - mc - vim - atop - wget - mytop - screen - net-tools - rsync - psmisc - gdb - subversion - htop - bind-utils - sysstat - nano - iptraf - nethogs - ngrep - tcpdump - lm_sensors - mtr - s3cmd - psmisc - gcc - git - python2-pip - python-devel - name: install the 'Development tools' package group yum: name: "@Development tools" state: present

The Essential package set is used to create a familiar working environment on any server.
The development tools package group, some libraries -devel and python need pip-u to build Python modules for PostgreSQL.

We use virtual machines based on VmWare ESXi, and for ease of administration, they need to run a vmware agent.

To do this, we will launch the open agent vmtoolsd, and describe its installation in a separate playbook (since not all servers are virtual, and for some of them this task will not be necessary):

/etc/ansible/tasks/open-vm-tools.yml

 --- - name: Install open VM tools for VMWARE yum: name={{ item }} state=latest tags: open-vm-tools with_items: - open-vm-tools - name: VmWare service start and enabling service: name=vmtoolsd.service state=started enabled=yes tags: open-vm-tools

In order to complete the preparation of the server for installing the main part of the software, in our case, the following steps will be needed:

1) set up time synchronization using ntp
2) install and run zabbix agent for monitoring
3) roll out the required ssh keys and authorized_keys.

In order not to inflate the article too much with details not related to the cluster proper, I will briefly quote ansible playbooks performing these tasks:

NTP:

/etc/ansible/tasks/ntpd.yml

 --- - name: setting default timezone set_fact: timezone: name=Europe/Moscow when: timezone is not defined - name: setting TZ timezone: name={{ timezone }} when: timezone is defined tags: - tz - tweaks - ntp - ntpd - name: Configurating cron for ntpdate cron: name="ntpdate" minute="*/5" job="/usr/sbin/ntpdate pool.ntp.org" tags: - tz - tweaks - ntp - ntpd - name: ntpd stop and disable service: name=ntpd state=stopped enabled=no tags: - tz - tweaks - ntp - ntpd ignore_errors: yes - name: crond restart and enabled service: name=crond state=restarted enabled=yes tags: - tz - tweaks - ntp - ntpd

First, it checks whether the server has a personal time zone, and if not, it sets Moscow time (we have the majority of such servers).

We do not use ntpd because of problems with time sailing on ESXi virtual machines, after which ntpd refuses to synchronize time. (And tinker panic 0 does not help). Therefore, we simply run the ntp client cron once every 5 minutes.

Zabbix-agent:

/etc/ansible/tasks/zabbix.yml

 --- - name: set zabbix ip external set_fact: zabbix_ip: 132.xx.xx.98 tags: zabbix - name: set zabbix ip internal set_fact: zabbix_ip: 192.168.xx.98 when: ansible_all_ipv4_addresses | ipaddr('192.168.0.0/16') tags: zabbix - name: Import Zabbix3 repo yum: name=http://repo.zabbix.com/zabbix/3.0/rhel/7/x86_64/zabbix-release-3.0-1.el7.noarch.rpm state=present tags: zabbix - name: Remove old zabbix yum: name=zabbix2* state=absent tags: zabbix - name: Install zabbix-agent software yum: name={{ item }} state=latest tags: zabbix with_items: - zabbix-agent - zabbix-release - name: Creates directories file: path={{ item }} state=directory tags: - zabbix - zabbix-mysql with_items: - /etc/zabbix/externalscripts - /etc/zabbix/zabbix_agentd.d - /var/lib/zabbix - name: Copy scripts copy: src=/etc/ansible/templates/zabbix/{{ item }} dest=/etc/zabbix/externalscripts/{{ item }} owner=zabbix group=zabbix mode=0755 tags: zabbix with_items: - netstat.sh - iostat.sh - iostat2.sh - iostat_collect.sh - iostat_parse.sh - php_workers_discovery.sh - name: Copy .my.cnf copy: src=/etc/ansible/files/mysql/.my.cnf dest=/var/lib/zabbix/.my.cnf owner=zabbix group=zabbix mode=0700 tags: - zabbix - zabbix-mysql - name: remove default configs file: path={{ item }} state=absent tags: zabbix with_items: - /etc/zabbix_agentd.conf - /etc/zabbix/zabbix_agentd.conf - name: put zabbix-agentd.conf to default place template: src=/etc/ansible/templates/zabbix/zabbix_agentd.tpl dest=/etc/zabbix_agentd.conf owner=zabbix group=zabbix force=yes tags: zabbix - name: link zabbix-agentd.conf to /etc/zabbix file: src=/etc/zabbix_agentd.conf dest=/etc/zabbix/zabbix_agentd.conf state=link tags: zabbix - name: zabbix-agent start and enable service: name=zabbix-agent state=restarted enabled=yes tags: zabbix

When installing Zabbix, the agent config is rolled out of the template, you only need to change the server address.

Servers located within our network go to 192.168.x.98, and servers that do not have access to it, to the real address of the same server.

The transfer of ssh keys and the ssh setting are in a separate role, which can be found, for example, on an ansible-galaxy.

There are many options, but the essence of the changes is quite trivial, so I don’t see any point in quoting all its content here.

It is time to roll on the server configuration. In general, I install all the components and the cluster itself in one step, already with the full config, but it seems to me that for the purposes of this tutorial, it would be better to divide it into two steps, respectively, by chapters.

Create a playbook for a group of servers:

/etc/ansible/cluster-pgsql.yml

 --- - hosts: pgsql pre_tasks: - name: Setting system hostname hostname: name="{{ ansible_host }}" - include: tasks/essentialsoftware.yml - include: tasks/open-vm-tools.yml - include: tasks/ntpd.yml post_tasks: - include: tasks/zabbix.yml roles: - ssh.role - ansible-role-patroni

We start processing all servers:

Hidden text

~ # ansible-playbook cluster-pgsql.yml --skip-tags patroni

If you have completely downloaded my example from the githab repository, then you will also have the role of Patroni, which we do not need to work out yet.

The --skip-tags argument causes Ansible to skip steps marked with this tag, so the role of ansible-role-patroni will not be executed now.

If it is not on the disk, then nothing terrible will happen, Anisble will simply ignore this key.

Ansible comes to my server immediately as root, and if you need to let the ansible under an unprivileged user, you should add the special “become: true” flag to the steps requiring root rights, which will cause the ansible to use sudo calls for these steps.

Preparation is over.

Part II

We proceed to the deployment of the cluster itself.

Since it takes a lot of work to set up a cluster (install PostgreSQL and all components, upload individual configs for them), I selected the whole process into a separate role.

Roles in Ansible allow you to group sets of related tasks, and thus simplify the writing of scripts and support them in working condition.

I took the role template for installing Patroni here: https://github.com/gitinsky/ansible-role-patroni , for which I thank its author.
For my purposes, I reworked the existing one and added my haproxy and keepalived playbooks.

My roles are in the / etc / ansible / roles directory. Create a directory for a new role, and subdirectories for its components:

 ~# mkdir /etc/ansible/roles/ansible-role-patroni/tasks ~# mkdir /etc/ansible/roles/ansible-role-patroni/templates

In addition to PostgreSQL, our cluster will consist of the following components:

1) haproxy to monitor the status of servers and redirect requests to the master server.
2) keepalived to ensure the presence of a single point of entry into the cluster - virtual IP.

All playbooks executed by this role are listed in a file that is run by ansible by default:

/etc/ansible/roles/ansible-role-patroni/tasks/main.yml

 - include: postgres.yml - include: haproxy.yml - include: keepalived.yml

Next, we begin to describe the individual task.

The first playbook installs PostgreSQL 9.6 from the native repository, and additional packages required by Patroni, and then downloads Patroni itself from GitHub:

/etc/ansible/roles/ansible-role-patroni/tasks/postgres.yml

 --- - name: Import Postgresql96 repo yum: name=https://download.postgresql.org/pub/repos/yum/9.6/redhat/rhel-7-x86_64/pgdg-centos96-9.6-3.noarch.rpm state=present tags: patroni when: install is defined - name: Install PGsql96 yum: name={{ item }} state=latest tags: patroni with_items: - postgresql96 - postgresql96-contrib - postgresql96-server - python-psycopg2 - repmgr96 when: install is defined - name: checkout patroni git: repo=https://github.com/zalando/patroni.git dest=/opt/patroni tags: patroni when: install is defined - name: create /etc/patroni file: state=directory dest=/etc/patroni tags: patroni when: install is defined - name: put postgres.yml template: src=postgres0.yml dest=/etc/patroni/postgres.yml backup=yes tags: patroni when: install is defined - name: install python packages pip: name={{ item }} tags: patroni with_items: - python-etcd - python-consul - dnspython - boto - mock - requests - six - kazoo - click - tzlocal - prettytable - PyYAML when: install is defined - name: put patroni.service systemd unit template: src=patroni.service dest=/etc/systemd/system/patroni.service backup=yes tags: patroni when: install is defined - name: Reload daemon definitions command: /usr/bin/systemctl daemon-reload tags: patroni - name: restart service: name=patroni state=restarted enabled=yes tags: patroni

In addition to installing the software, this playbook also fills in the configuration for the current Patroni server, and the systemd unit to start the daemon in the system, and then starts the Patroni daemon. Config templates and systemd units should be in the templates directory inside the role.

Patroni config template:

/etc/ansible/roles/ansible-role-patroni/templates/postgres.yml.j2

 name: {{ patroni_node_name }} scope: &scope {{ patroni_scope }} consul: host: consul.services.local:8500 restapi: listen: 0.0.0.0:8008 connect_address: {{ ansible_default_ipv4.address }}:8008 auth: 'username:{{ patroni_rest_password }}' bootstrap: dcs: ttl: &ttl 30 loop_wait: &loop_wait 10 maximum_lag_on_failover: 1048576 # 1 megabyte in bytes postgresql: use_pg_rewind: true use_slots: true parameters: archive_mode: "on" wal_level: hot_standby archive_command: mkdir -p ../wal_archive && cp %p ../wal_archive/%f max_wal_senders: 10 wal_keep_segments: 8 archive_timeout: 1800s max_replication_slots: 5 hot_standby: "on" wal_log_hints: "on" pg_hba: # Add following lines to pg_hba.conf after running 'initdb' - host replication replicator 192.168.0.0/16 md5 - host all all 0.0.0.0/0 md5 postgresql: listen: 0.0.0.0:5432 connect_address: {{ ansible_default_ipv4.address }}:5432 data_dir: /var/lib/pgsql/9.6/data pg_rewind: username: superuser password: {{ patroni_postgres_password }} pg_hba: - host all all 0.0.0.0/0 md5 - hostssl all all 0.0.0.0/0 md5 replication: username: replicator password: {{ patroni_replicator_password }} network: 192.168.0.0/16 superuser: username: superuser password: {{ patroni_postgres_password }} admin: username: admin password: {{ patroni_postgres_password }} restore: /opt/patroni/patroni/scripts/restore.py

Since each server in the cluster requires an individual configuration of Patroni, its config is in the form of a jinja2 template (postgres0.yml.j2 file), and the template step forces the ansible to broadcast this template with the replacement of variables, the values of which are taken from a separate description for each server.

Variables that are common to the entire cluster will be specified in the inventory, which now takes the following form:

/ etc / ansible / hosts

 [pgsql] cluster-pgsql-01.local cluster-pgsql-02.local cluster-pgsql-03.local [pgsql:vars] patroni_scope: "cluster-pgsql" patroni_rest_password: flsdjkfasdjhfsd patroni_postgres_password: flsdjkfasdjhfsd patroni_replicator_password: flsdjkfasdjhfsd cluster_virtual_ip: 192.xx.xx.125 </spoiler>      -   host_vars/_: <spoiler title="/etc/ansible/host_vars/pgsql-cluster-01.local/main.yml"> <source lang="yaml"> patroni_node_name: cluster_pgsql_01 keepalived_priority: 99

Decipher why some variables are needed:

patroni_scope - cluster name when registering with Consul
patroni_node_name - server name when registering with Consul
patroni_rest_password - password for http Patroni interface (required for sending commands to change the cluster)
patroni_postgres_password: password for postgres user. It is installed in the case of creating a new patroni base.
patroni_replicator_password - password for user replicator. On its behalf, replication is performed on the slaves.

Also in this file are some other variables used in other playbooks or roles, in particular, it can be setting ssh (keys, users), timezone for the server, server priority in the cluster keepalived, etc.

The configuration for other servers is similar, the server name and priority change accordingly (for example, 99-100-101 for three servers).

Install and configure haproxy:

/etc/ansible/roles/ansible-role-patroni/tasks/haproxy.yml

 --- - name: Install haproxy yum: name={{ item }} state=latest tags: - patroni - haproxy with_items: - haproxy when: install is defined - name: put config template: src=haproxy.cfg.j2 dest=/etc/haproxy/haproxy.cfg backup=yes tags: - patroni - haproxy - name: restart and enable service: name=haproxy state=restarted enabled=yes tags: - patroni - haproxy

Haproxy is installed on each host, and contains in its config links to all PostgreSQL servers, checks which server is currently the master, and sends requests to it.
For this check, a beautiful feature of Patroni is used - the REST interface.

When accessing the URL server : 8008 (8008 is the default port) Patroni returns a report on the cluster status in json, and also reflects the response code http whether this server is a master. If it is, there will be a response with code 200. If not, the answer is with code 503.

I strongly advise you to go to the documentation for Patroni, the http interface is quite interesting there, it is also allowed to forcibly switch roles, and manage the cluster.
Similarly, this can be done using the console utility patronyctl.py, supplied by Patroni.

The haproxy configuration is fairly simple:

/etc/ansible/roles/ansible-role-patroni/templates/haproxy.cfg

 global maxconn 800 defaults log global mode tcp retries 2 timeout client 30m timeout connect 4s timeout server 30m timeout check 5s frontend ft_postgresql bind *:5000 default_backend postgres-patroni backend postgres-patroni option httpchk http-check expect status 200 default-server inter 3s fall 3 rise 2 server {{ patroni_node_name }} {{ patroni_node_name }}.local:5432 maxconn 300 check port 8008 server {{ patroni_node_name }} {{ patroni_node_name }}.local:5432 maxconn 300 check port 8008 server {{ patroni_node_name }} {{ patroni_node_name }}.local:5432 maxconn 300 check port 8008

In accordance with this configuration, haproxy listens to port 5000, and sends traffic from it to the master server.

The status check occurs at 1 second intervals, 3 failed responses (code 500) are required to transfer the server to Down, and 2 successful responses (with a code 200) to switch the server back.
At any time, you can apply directly to any haproxy, and it will correctly route traffic to the master server.

Also included with Patroni is a template for configuring the confd daemon, and an example of its integration with etcd, which allows you to dynamically change the haproxy config when removing or adding new servers.

I'm still doing quite a static cluster, too much automation in this situation, IMHO, can lead to unforeseen problems.

We wanted clients to make special changes to the logic, tracking servers lively, etc. not required, so we make a single entry point to the cluster using keepalived.

The keepalived daemon works under the vrrp protocol with its neighbors, and as a result of selecting one of the daemons as the main one (the priority is specified in the config, and templated into the keepalived_priority variable in host_vars for each server), it raises the virtual ip address.
The rest of the demons are waiting patiently. If the current main server keepalived for some reason dies or signals the neighbors a crash, re-election will occur, and the next highest priority server will pick up a virtual ip address.

To protect against haproxy crashes, keepalived demons perform a scan by running the “killall -0 haproxy” command once a second. It returns the code 0 if the haproxy process is, and 1 if it is not.
If haproxy disappears, the keepalived daemon will signal the vrrp crash, and remove the virtual ip.
Virtual IP will immediately pick up the next highest priority server, with live haproxy.

Install and configure keepalived:

/etc/ansible/roles/ansible-role-patroni/tasks/keepalived.yml

 --- - name: Install keepalived yum: name={{ item }} state=latest tags: - patroni - keepalived with_items: - keepalived when: install is defined - name: put alert script template: src=alert.sh.j2 dest=/usr/local/sbin/alert.sh backup=yes mode=755 tags: - patroni - keepalived when: install is defined - name: put config template: src=keepalived.conf.j2 dest=/etc/keepalived/keepalived.conf backup=yes tags: - patroni - keepalived - name: restart and enable service: name=keepalived state=restarted enabled=yes tags: - patroni - keepalived

In addition to the keepalived installation, this playbook also copies a simple script for sending alerts via telegrams. The script accepts the message as a variable, and simply jerks the curl's telegram API.

In this script, you only need to specify your token and telegram group ID to send alerts.

The keepalived configuration is described as a jinja2 template:

/etc/ansible/roles/ansible-role-patroni/templates/keepalived.conf.j2

 global_defs { router_id {{ patroni_node_name }} } vrrp_script chk_haproxy { script "killall -0 haproxy" interval 1 weight -20 debug fall 2 rise 2 } vrrp_instance {{ patroni_node_name }} { interface ens160 state BACKUP virtual_router_id 150 priority {{ keepalived_priority }} authentication { auth_type PASS auth_pass secret_for_vrrp_auth } track_script { chk_haproxy weight 20 } virtual_ipaddress { {{ cluster_virtual_ip }}/32 dev ens160 } notify_master "/usr/bin/sh /usr/local/sbin/alert.sh '{{ patroni_node_name }} became MASTER'" notify_backup "/usr/bin/sh /usr/local/sbin/alert.sh '{{ patroni_node_name }} became BACKUP'" notify_fault "/usr/bin/sh /usr/local/sbin/alert.sh '{{ patroni_node_name }} became FAULT'" }

The variables patroni_node_name, cluster_virtual_ip and keepalived_priority translate the corresponding data from host_vars.

Also in the keepalived config, a script is specified for sending status change messages to the telegram channel.

We roll the entire cluster configuration on the server:

 ~# ansible-playbook cluster-pgsql.yml

Since Ansible is idempotent, i.e. performs the steps only if they have not been performed previously, you can start the playbook without additional parameters.

If you do not want to wait any longer, or you are sure that the servers are fully ready, you can start the ansible-playbook with the -t patroni key.

Then only the steps from the role of Patroni will be performed.

I note that I do not specify separately the role of servers - master or slave. This configuration will create an empty database, and the first configured server will simply become the master.

When adding new servers, Patroni will see through DCS what the cluster master already has, automatically copy the base from the current master, and connect the slave to it.

In the case of the start of the slave who has lagged behind for some time from the master, Patroni will automatically inject changes with the help of pg_rewind.

We are convinced that all servers have started and chose roles for themselves:

 ~# journalctl -f -u patroni

Messages from the slave (server cluster-pgsql-01):

spoiler

Feb 17 23:50:32 cluster-pgsql-01.local patroni.py[100626]: 2017-02-17 23:50:32,254 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_01
Feb 17 23:50:32 cluster-pgsql-01.local patroni.py[100626]: 2017-02-17 23:50:32,255 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_01
Feb 17 23:50:32 cluster-pgsql-01.local patroni.py[100626]: 2017-02-17 23:50:32,255 INFO: does not have lock
Feb 17 23:50:32 cluster-pgsql-01.local patroni.py[100626]: 2017-02-17 23:50:32,255 INFO: no action. i am a secondary and i am following a leader

Messages from the wizard (in this case, the server is cluster-pgsql-02):

spoiler

Feb 17 23:52:23 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:52:23,457 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_02
Feb 17 23:52:23 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:52:23,874 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_02
Feb 17 23:52:24 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:52:24,082 INFO: no action. i am the leader with the lock
Feb 17 23:52:33 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:52:33,458 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_02
Feb 17 23:52:33 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:52:33,884 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_02
Feb 17 23:52:34 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:52:34,094 INFO: no action. i am the leader with the lock

The logs clearly show that each server constantly monitors its status and master status.
Let's try to stop the master:

 ~# systemctl stop patroni

spoiler

Feb 17 23:54:03 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:54:03,457 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_02
Feb 17 23:54:03 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:54:03,880 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_02
Feb 17 23:54:04 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:54:04,092 INFO: no action. i am the leader with the lock
Feb 17 23:54:11 cluster-pgsql-02.local systemd[1]: Stopping Runners to orchestrate a high-availability PostgreSQL...
Feb 17 23:54:13 cluster-pgsql-02.local patroni.py[4913]: waiting for server to shut down.... done
Feb 17 23:54:13 cluster-pgsql-02.local patroni.py[4913]: server stopped

But what happened at this moment on the slave:

spoiler

Feb 17 19:54:12 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:12,353 INFO: does not have lock
Feb 17 19:54:12 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:12,776 INFO: no action. i am a secondary and i am following a leader
Feb 17 19:54:13 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:13,440 WARNING: request failed: GET http://192.xx.xx.121:8008/patroni (HTTPConnectionPool(host='192.xx.xx.121', port=8008
): Max retries exceeded with url: /patroni (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x1f12750>: Failed to establish a new connection: [Er
rno 111] Connection refused',)))
Feb 17 19:54:13 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:13,444 INFO: Got response from cluster_pgsql_03 http://192.xx.xx.122:8008/patroni: {"database_system_identifier": "63847
30077944883705", "postmaster_start_time": "2017-02-17 05:36:52.388 MSK", "xlog": {"received_location": 34997272728, "replayed_timestamp": null, "paused": false, "replayed_location": 34997272
728}, "patroni": {"scope": "clusters-pgsql", "version": "1.2.3"}, "state": "running", "role": "replica", "server_version": 90601}
Feb 17 19:54:13 cluster-pgsql-01 patroni.py: server promoting
Feb 17 19:54:13 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:13,961 INFO: cleared rewind flag after becoming the leader
Feb 17 19:54:14 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:14,179 INFO: promoted self to leader by acquiring session lock
Feb 17 19:54:23 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:23,436 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_01
Feb 17 19:54:23 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:23,857 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_01
Feb 17 19:54:24 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:24,485 INFO: no action. i am the leader with the lock

This server has intercepted the master role on itself.

And now we will return server 2 back to the cluster:

 ~# systemctl start patroni

Spoiler header

Feb 18 00:02:11 cluster-pgsql-02.local systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL.
Feb 18 00:02:11 cluster-pgsql-02.local systemd[1]: Starting Runners to orchestrate a high-availability PostgreSQL...
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,186 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_02
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,190 WARNING: Postgresql is not running.
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,190 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_02
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,398 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_02
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,400 INFO: starting as a secondary
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,412 INFO: rewind flag is set
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,609 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_02
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,609 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_02
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,609 INFO: changing primary_conninfo and restarting in progress
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,631 INFO: running pg_rewind from user=superuser host=192.xx.xx.120 port=5432 dbname=postgres sslmode=prefer sslcompression=1
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: servers diverged at WAL position 8/26000098 on timeline 25
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: rewinding from last common checkpoint at 8/26000028 on timeline 25
Feb 18 00:02:14 cluster-pgsql-02.local patroni.py[56855]: Done!
Feb 18 00:02:14 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:14,535 INFO: postmaster pid=56893
Feb 18 00:02:14 cluster-pgsql-02.local patroni.py[56855]: < 2017-02-18 00:02:14.554 MSK > LOG: redirecting log output to logging collector process
Feb 18 00:02:14 cluster-pgsql-02.local patroni.py[56855]: < 2017-02-18 00:02:14.554 MSK > HINT: Future log output will appear in directory "pg_log".
Feb 18 00:02:15 cluster-pgsql-02.local patroni.py[56855]: localhost:5432 - accepting connections
Feb 18 00:02:15 cluster-pgsql-02.local patroni.py[56855]: localhost:5432 - accepting connections
Feb 18 00:02:15 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:15,790 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_02
Feb 18 00:02:15 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:15,791 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_02
Feb 18 00:02:15 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:15,791 INFO: does not have lock
Feb 18 00:02:15 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:15,791 INFO: establishing a new patroni connection to the postgres cluster
Feb 18 00:02:16 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:16,014 INFO: no action. i am a secondary and i am following a leader

Patroni discovered that she was connecting to the cluster with the existing master, and by updating the database to the current state, she correctly assumed the role of a slave.

Let's try to create an error on another layer of the cluster by stopping haproxy on the main server keepalived.

By priority, this role is taken by the second server:

[root@cluster-pgsql-02 ~]# ip a 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 link/ether 00:50:56:a9:b8:7b brd ff:ff:ff:ff:ff:ff inet 192.xx.xx.121/24 brd 192.168.142.255 scope global ens160 valid_lft forever preferred_lft forever inet 192.xx.xx.125/32 scope global ens160 <---- valid_lft forever preferred_lft forever inet6 fe80::xxx::4895:6d90/64 scope link valid_lft forever preferred_lft forever

Stop haproxy:

 ~# systemctl stop haproxy ; journalctl -fl

Feb 18 00:18:54 cluster-pgsql-02.local Keepalived_vrrp[25018]: VRRP_Script(chk_haproxy) failed Feb 18 00:18:56 cluster-pgsql-02.local Keepalived_vrrp[25018]: VRRP_Instance(cluster_pgsql_02) Received higher prio advert Feb 18 00:18:56 cluster-pgsql-02.local Keepalived_vrrp[25018]: VRRP_Instance(cluster_pgsql_02) Entering BACKUP STATE Feb 18 00:18:56 cluster-pgsql-02.local Keepalived_vrrp[25018]: VRRP_Instance(cluster_pgsql_02) removing protocol VIPs. Feb 18 00:18:56 cluster-pgsql-02.local Keepalived_vrrp[25018]: Opening script file /usr/bin/sh Feb 18 00:18:56 cluster-pgsql-02.local Keepalived_healthcheckers[25017]: Netlink reflector reports IP 192.xx.xx.125 removed

Keepalived caught the problem, removed the virtual address from itself, and signaled its neighbors about it.

Look what happened on the second server:

Feb 18 00:18:56 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) forcing a new MASTER election Feb 18 00:18:56 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) forcing a new MASTER election Feb 18 00:18:56 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) forcing a new MASTER election Feb 18 00:18:56 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) forcing a new MASTER election Feb 18 00:18:57 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Transition to MASTER STATE Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Entering MASTER STATE Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) setting protocol VIPs. Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Sending gratuitous ARPs on ens160 for 192.xx.xx.125 Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: Opening script file /usr/bin/sh Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Received lower prio advert, forcing new election Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Sending gratuitous ARPs on ens160 for 192.xx.xx.125 Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_healthcheckers[41189]: Netlink reflector reports IP 192.xx.xx.125 added Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Received lower prio advert, forcing new election Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Sending gratuitous ARPs on ens160 for 192.xx.xx.125 Feb 18 00:19:03 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Sending gratuitous ARPs on ens160 for 192.xx.xx.125

Twice there were re-elections (because the third server of the cluster managed to send its announcement before the first election), server 1 assumed the role of moderator, and set up a virtual IP.

We are convinced of this:

[root@cluster-pgsql-01 log]# ip a 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 link/ether 00:50:56:a9:f0:90 brd ff:ff:ff:ff:ff:ff inet 192.xx.xx.120/24 brd 192.xx.xx.255 scope global ens160 valid_lft forever preferred_lft forever inet 192.xx.xx.125/32 scope global ens160 <---- ! valid_lft forever preferred_lft forever inet6 fe80::1d75:40f6:a14e:5e27/64 scope link valid_lft forever preferred_lft forever

Now virtual IP is present on a server that is not a replication master. However, this does not matter, because we access haproxy through the database, and it monitors the cluster status independently, and always sends requests to the master.

When haproxy returns to the system on the second server, re-election occurs again (keepalived with a higher priority is put into operation), and the virtual IP returns to its place.

In rare cases it happens that the slave cannot catch up to the master (for example, he fell a long time ago and the wal magazine managed to partly retire). In this case, you can completely clear the directory with the base on the slave:

"rm -rf /var/lib/pgsql/9.6/data", and restart Patroni. It will merge the base from the master entirely.
(Carefully with cleaning up "unnecessary" databases, carefully look at which server you are running the command !!!)

In this case, you need to use the patronictl utility. The reinit command allows you to safely clean a specific cluster node; it will not be executed by the wizard.
Thanks for the additionCyberDemon.

The patronictl utility itself allows you to see the current situation with the cluster via the command line, without contacting DCS, and manage the cluster.

Example cluster status report:
/opt/patroni/patronictl.py -c /etc/patroni/postgres.yml list cluster-pgsql:

+ --------------- + ------------------ + -------------- --- + -------------- + ------------------ + ----------- +
 |Cluster | Member | Host | Role | State | Lag in MB |
+---------------+------------------+-----------------+--------------+------------------+-----------+
 | cluster-pgsql | cluster_pgsql_01 | 192.xxx.xxx.120 | Leader | running | 0.0 |
 | cluster-pgsql | cluster_pgsql_02 | 192.xxx.xxx.121 | Sync standby | running | 0.0 |
 | cluster-pgsql | cluster_pgsql_03 | 192.xxx.xxx.122 |  | creating replica | 33712.0 |
+---------------+------------------+-----------------+--------------+------------------+-----------+

In this case, the third node is poured, its gap from the master is 33 GB.
After completing this process, it also enters the Running state with a zero lag.
You can also note that the State field is empty. This is because the cluster in my case is running in synchronous mode. To reduce the lag of synchronous replication, one slave operates in synchronous mode, and the other in the usual asynchronous mode. In the case of the loss of the master role will shift, and the second slave will go into synchronous mode to become the master, the first slave.

Afterword

The only thing that this cluster, in my opinion, is not enough for happiness is the pooling of connections and proxying read requests to all slaves to improve read performance, and insertion and update requests only to the master.

In a configuration with asynchronous replication, unfolding the load on the read can lead to unexpected responses, if the slave falls behind the master, this should be taken into account.

Streaming (asynchronous) replication does not provide cluster consistency at any time, and this requires synchronous replication.

In this mode, the master server will wait for confirmations about copying and applying transactions to the slaves, which will slow down the operation of the database. However, if transaction losses are invalid (for example, some financial applications), synchronous replication is your choice.

Patroni supports all options, and if synchronous replication suits you better, you just need to change the value of several fields in the Patroni configs.
Questions of different replication techniques are perfectly sorted out in the Patroni documentation.

Someone will probably suggest using pgpool, which itself, in fact, covers all the functionality of this system. It can monitor databases, proxy requests, and set up virtual IP, as well as pool connections to clients.

Yes he can. But in my opinion, the scheme with Patroni is much more transparent (of course this is just my opinion), and during experiments with pgpool I caught strange behavior with his search and virtual addresses, which did not debit too deep yet, deciding to look for another solution.

Of course it is possible that the problem is only mine in my hands, and later I plan to return to testing pgpool.

However, in any case, pgpool will not be able to fully automatically manage the cluster, introduce new and (especially) return failed servers, work with DCS. In my opinion, this is the most interesting functionality of Patroni.

Thank you for your attention, I will be glad to see suggestions for further improvement of this solution, and answer questions in the comments.

Many thanks to Zalando for Patroni, and to the authors of the original Governor project , which served as the basis for Patroni, and to Alex Chistyakov for the role template for Ansible.

The full code of playbooks and Ansible templates described in the article is here . I would be grateful for the improvements from the gurus Ansible and PostgreSQL. :)

Main articles and sources used:

Several PgSQL clusters:

→ https://habrahabr.ru/post/301370/
→ https://habrahabr.ru/post/213409/
→ https://habrahabr.ru/company/etagi/ blog / 314000 /

→ a post about Patroni on the Zalando blog
→ Patroni project
→ ansible-role-patroni Alex Chistyakov
→ Governor - unfortunately the development has long been frozen.
→ The book Ansble for Devops is a great tutorial with lots of examples of using Ansible.

Source: https://habr.com/ru/post/322036/

All Articles

Highly reliable PostgreSQL cluster based on Patroni, Haproxy, Keepalived

Introduction

Part I

Part II

Afterword

More articles: