Good day to all!
As you know,
PostgreSQL 9.2 was recently
released with lots of interesting and useful things. Without thinking twice, we decided to upgrade our stream replication cluster from 9.0 to 9.2. All would be nothing if not for several circumstances:
- This is a production with a large daily traffic.
- Downtime is excluded.
Well, it's even more interesting ... How we did it and what came of it, read on.
Why all this?
- the desire to get buns appeared in PostgreSQL 9.2;
- Translation PostgreSQL wizard to server with flashcache.
Given:
- 3 hardware servers, two of which host a master + slave 9.0 bundle and another unused server with a flashcache;
- the bundle is replicated using inline streaming replication;
- Four backends with the application and a machine with sphinx constantly work with the database servers.
Difficulties:
- the cluster will not work if you just upgrade the package from 9.0 to 9.2 (the cluster should be reinitialized or updated);
- pg_upgrade is impossible without stopping the cluster;
- Reinitialization and the subsequent pg_restore also cannot be done, for downtime;
- You cannot first update the master, and then update the slave. Streaming replication between major versions will not work.
')
What we found a way out?
There is an exit in using Londiste from Skytools-3.0 package, once we moved with it from 8.4 to 9.0, so there is experience. Replication with Londiste is convenient because it allows replication of separate tables and databases in a cluster (for example, stream replication, replicates the entire cluster). Plus, with respect to the past move, we have stream replication. And this is also not a problem. Data replicated via Londiste will immediately be replicated to the freshly raised 9.2 slave using streaming replication. The scheme goes great: having joined 9.2, we transparently fill in the data of the slave 9.2. So the scheme and algorithm of the problem:

1. Admin part:
- raise master and slave 9.2. Slave 9.2 is launched on port 6543, since the standard port is already taken (see picture);
- raise stream replication between them;
- install Skytools on the newly configured wizard 9.2;
- configuring Londiste. From the master 9.0 we do the provider, from the master 9.2 we do the subscriber;
- we start on the master 9.2 londiste and pgqd, then we check the performance of the bundle using the tools built into the londiste;
- on the provider side, we add all tables and sequences to replication (it’s worth noting that replication-capable tables are only those that have primary keys. If tables are found without keys, then we should create keys there or have to transfer them with our hands ... we had a part schemes that were cheaper to transfer by hand than to create keys there);
- we define those charts and tables that need to be transferred by hands;
- at the subscriber, we start the replication of the test table and make sure that the data from the provider 9.0 goes to the subscriber 9.2 and then through the stream replication we reach the slave 9.2.
So from the technical side everything is ready. Now it remains to plan the progress of replication and the moment of switching. The day of the switch is selected Saturday. If something goes wrong, we have Sunday. We divided the activities into two stages, the preparatory stage and the switching stage. How will the switching be performed? For this, we introduced two new DNS names for the new bundle 9.2: db-master and db-slave. At the right moment, we will write these names into the backend configs and restart the applications.
Some of the activities of the preparatory plan have already been described above, but for the sake of completeness, I will nevertheless leave them in brief:
Until Friday:
- we lift the new cluster pg-9.2;
- configure londiste between pg-9.0 and pg-9.2;
- we raise the new slave.pg-9.2 on the neighboring port and configure streaming replication with master.pg-9.2;
- prepare fresh pgfouine reports from master 9.0 and slave 9.0 for any weekday. Reports will be needed to compare plans for top queries;
- add all schemes to the londiste provider that do not require the creation of primary keys;
- check from all backends the ability to connect to new PostgreSQL instances;
- double-check configs for new bases, connection limits, auto-vacuum settings;
- set up monitoring for new databases (use zabbix together with self-written bash scripts pulling pg_stat * tables);
- for new databases, create dns-names db-master and db-slave;
- to warn the editors about work on Saturday (this is just a warning to the upper ranks so that they are ready and do not ask questions if anything happens).
Friday:
- disable night import (this is hellish data import, mechnizm is such that 100% breaks londiste replication. This is the internal kitchen of the project, but I mention it because any project may have a similar component, so you should consider all the elements that affect the purpose of the operation - the database );
- start data transfer through londiste. When a table is added to the subscriber, the replication mechanism is started by COPY, after which the consistent state of the table is fixed and it is considered replicable);
- prepare a list of schemes for manual transfer;
Saturday: this is the day of switching:
- check the top 10 requests from the master and the slave at 9.2 (maybe the game just shouldn't be worth it?);
- prepare commands for manual transfer of schemes (stupidly to hammer them into the console and press Enter at the right moment);
11.00-12.00 suspension of editing:
- stop crones, background tasks daemon and wait for the completion of active tasks;
- close editing (at this point, it is impossible for the application to write to the database, so we will avoid editing the database from the customers of the site and the risk of receiving inconsistent data when restarting applications on backends);
- dump the remaining schemes to the new database;
12.00-12.30 switching:
- we collapse londiste replication (we display tables, sequences, nodes, we stop londiste and pgqd);
- fix configs on backends;
- restart applications on backends (nginx + passenger);
- update the config for sphinx and restart it;
Everything. after that, replication via londiste becomes inconsistent, since the entire external record (the source of the record is the clients on the site) went to cluster 9.2;
- correct the configuration of the daemon of background tasks and run it. run the crown;
- open edit;
- open all monitoring and search for possible jambs.
After switching:
- enable night import;
- check cron logs, daemon logs of background tasks, replication lag.
After moving
- transfer db-slave to the standard port, for this:
- backends working with the slave switch to work with the master;
- turn off pg-9.0;
- configure the new pg-9.2 to work with full memory (it’s unforgettable that there were 2 PostgreSQL instances on the host, so I had to divide the memory between them);
- run db-slave on port 5432, check connectivity from backends and from sphinx;
- check the integrity and lag of replication;
- put a slave on the backend side.
Rollback Plan in case something goes wrong:
- close editing;
- stop the daemon of background tasks and crowns, wait for the completion of active tasks;
- correct the names of the database servers in the configs to the original ones, restart the backends, start the background tasks daemon;
- roll back config for sphinx;
- open edit.
Actually the whole algorithm. In the course of the event, of course, not everything went according to the master plan. Fortunately, there was no need to resort to a rollback plan.
If we talk about what went wrong, then there are only a couple of points,
The first point concerns the recently introduced service and the mechanism of manual transfer of schemes (which should be avoided). A few words about the service: a service based on the work of pgq collapsed, it was not quite clear how to replicate the pgq scheme (pgq was itself part of the replication mechanism). The manual transfer did not correct the situation either, so I had to reinitialize the circuit and restart the service (this is uncritically good, but still cant).
About the transfer of schemes ... practice has shown that the transfer of schemes does not always go as you like. Considering that the scheme of the entire database is created at an early stage of replication setup, in the future you have to transfer either the scheme over existing objects or individual data, during the transfer you can run into errors like:
ERROR: insert or update on table violates foreign key constraint
DETAIL: Key is not present in table.
From here a conclusion that transfer of schemes is better for doing so:
We rename the existing empty scheme in the destination database, then transfer the entire scheme from the source, delete the old renamed scheme from the destination database. Checking the uniformity of the schemes can be done through the bash structure. We run the command on both hosts, compare the output for matching (use diff)
In the end, of course, I would like to note that you need to check all the places where you can write to the database several times and eliminate the possibility of recording when switching, when some services / backends have already switched to the new database, and the other part is not there yet. If you think further, then theoretically you can completely translate the volume to readonly, and perform a switch (mount / dmsetup / blockdev).
Well, a little bit of graphs.
1. NewRelic. Backend switching process

2. Zabbix. Daily server work with PG 9.0 (Monday September 10)

3. Zabbix. Daily server work with PG 9.0 (Monday September 10)

4. Zabbix. Daily server work with PG 9.2 + FlashCache (Monday September 17)

5. Zabbix. Daily server work with PG 9.2 + FlashCache (Monday September 17)

The biggest evil in Zabbix charts is the black line reflecting iowait. As you can see, the use of flashcache significantly reduced the load on hard drives.
Who are interested in technical details:
how stream replication is configured in PostgreSQL, see
here .
how the table-based replication between PostgreSQL clusters is configured using Skytools-3, see
here .
This is the story of one subbotnik. Thanks for attention!