Docker container with Postgres data for integration testing and easy extension

About the use of Docker and Docker-compose lately a lot has been written, for example, I recommend a recent article on Habré , if you haven’t yet penetrated. It is really very convenient, and in a bundle in the ansible especially. And I use it everywhere. From development to automatic integration testing on CI . About use in testing, they also wrote . It's great and comfortable. However, for local development, for trashuting of data “as in production” or performance testing, on “volumes of people close to production”, I would like to have an image with the base “like in production” at hand!

Accordingly, it would be desirable that each developer, starting to work on a project, could launch it with one team, for example:

 ./gradlew dockerRun

and would the application go up at once with all the necessary associated containers? And the main thing is that there would already be data for most development and bugfixing cases, standard users and most working services that could be immediately started working without wasting time on exporting and importing any images or demo data out there!

What a nice bonus, isn't it great to have a database of several gigabytes and the ability to roll back to its original (or any other commit) state within a couple of seconds?

Of course, we'll talk about writing a Dockerfile for such an image with data, and some of the pitfalls of this process.

Here on this and focus. In our application, we actively use Postgres , so the story and examples will be about the container with it, but this applies only to examples, the essence of the presentation applies to any other relational or fashionable NoSQL database.

What do you want

First, let's define the problem in more detail. We are preparing an image with data that anyone who works with our application can work with:

Some data needs obfuscation (for example, mailboxes or personal data of users) - that is, you cannot simply restore the dump, it needs to be processed
- However, we want a general rolling mechanism for the SQL scripts provided for the assembly , without further modifying the assembly mechanism! For example, it can be used for sampling (reducing the amount of data) and reducing the size of the image
I would like to conveniently include additional settings in the image , without passing them by options during startup and without modifying the config every time with complex regular expressions
- We want to include in the image some configuration extensions in order not to write huge manuals on how to run it (passing the required options to start at the time of launch)
- There is a lot of data, so I also want to include optimization settings for increasing performance.
- In general, for many settings, you can do this through standard extension paths by executing the ALTER SYSTEM commands. But this is only for those settings that do not require definition only in the configuration file or startup parameters .
The data is the same for developers and CI

Getting started

I will not start with what Dockerfile , I hope you are already familiar with this. Those who want to get an idea, refer to the article , well, or official documentation .

It is worth noting that the official docker image of postgres already has several extension points:

POSTGRES_* variables
And the directory inside the image /docker-entrypoint-initdb.d where you can put sh scripts or sql files to be executed at the start. This is very convenient if you say you want to create additional users or databases, set permissions, and initialize extensions.

However, for our purposes this is not enough:

We cannot include some data, erasing it at the start:
- First, it can lead to a huge database size (we want to delete some logs or history)
- Second, the user can run the image by redefining the entrypoint , and see private data that should not be seen.
Additionally, we can pass almost any parameters on the command line at startup, such as --max_prepared_transactions=110 , but we cannot easily put them into the image, and make them standard
- So, for example, since we are building an image for testing and we have a quick rollback, I want to include aggressive performance optimization settings rather than reliability (including the complete shutdown of fsync )

I’ll probably show the prototype of the file right away (only some insignificant parts are cut so that it becomes smaller, for example, the inclusion of the pg_hint_plan extension, which is pg_hint_plan on Debian from RPM , is pg_hint_plan because it is not in Deb and the official repositories):

Dockerfile

 FROM postgres:9.6 MAINTAINER Pavel Alexeev # Do NOT use /var/lib/postgresql/data/ because its declared as volume in base image and can't be undeclared but we want persist data in image ENV PGDATA /var/lib/pgsql/data/ ENV pgsql 'psql -U postgres -nxq -v ON_ERROR_STOP=on --dbname somedb' ENV DB_DUMP_URL 'ftp://user:password@ftp.somehost.com/desired_db_backup/somedb_dump-2017-02-21-16_55_01.sql.gz' COPY docker-entrypoint-initdb.d/* /docker-entrypoint-initdb.d/ COPY init.sql/* /init.sql/ # Later in RUN we hack config to include conf.d parts. COPY postgres.conf.d/* /etc/postgres/conf.d/ # Unfortunately Debian /bin/sh is dash shell instead of bash (https://wiki.ubuntu.com/DashAsBinSh) and some handy options like pipefaile is unavailable # Separate RUN to next will be in bash instead of dash. Change /bin/sh symlink as it is hardcoded https://github.com/docker/docker/issues/8100 RUN ln -sb /bin/bash /bin/sh RUN set -euo pipefail \ && echo '1) Install required packages' `# https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#apt-get` \ && apt-get update \ && apt-get install -y \ curl \ postgresql-plperl-9.6 \ && echo '3) Run postgres DB internally for init cluster:' `# Example how to run instance of service: http://stackoverflow.com/questions/25920029/setting-up-mysql-and-importing-dump-within-dockerfile`\ && bash -c '/docker-entrypoint.sh postgres --autovacuum=off &' \ && sleep 10 \ && echo '4.1) Configure postgres: use conf.d directory:' \ && sed -i "s@#include_dir = 'conf.d'@include_dir = '/etc/postgres/conf.d/'@" "$PGDATA/postgresql.conf" \ && echo '4.2) Configure postgres: Do NOT chown and chmod each time on start PGDATA directory (speedup on start especially on Windows):' \ && sed -i 's@chmod 700 "$PGDATA"@#chmod 700 "$PGDATA"@g;s@chown -R postgres "$PGDATA"@#chown -R postgres "$PGDATA"@g' /docker-entrypoint.sh \ && echo '4.3) RERun postgres DB for work in new configuration:'\ && gosu postgres pg_ctl -D "$PGDATA" -m fast -w stop \ && sleep 10 \ && bash -c '/docker-entrypoint.sh postgres --autovacuum=off --max_wal_size=3GB &' \ && sleep 10 \ && echo '5) Populate DB data: Restore DB backup:' \ && time curl "$DB_DUMP_URL" \ | gzip --decompress \ | grep -Pv '^((DROP|CREATE|ALTER) DATABASE|\\connect)' \ | $pgsql \ && echo '6) Execute build-time sql scripts:' \ && for f in /init.sql/*; do echo "Process [$f]"; $pgsql -f "$f"; rm -f "$f"; done \ && echo '7) Update DB to current migrations state:' \ && time java -jar target/db-updater-*.jar -f flyway.url=jdbc:postgresql://localhost:5432/somedb -f flyway.user=postgres -f flyway.password=postgres \ && echo '8) Vacuum full and analyze (no reindex need then):' \ && time vacuumdb -U postgres --full --all --analyze --freeze \ && echo '9) Stop postgres:' \ && gosu postgres pg_ctl -D "$PGDATA" -m fast -w stop \ && sleep 10 \ && echo '10) Cleanup pg_xlog required to do not include it in image!:' `# Command inspired by http://www.hivelogik.com/blog/?p=513` \ && gosu postgres pg_resetxlog -o $( LANG=C pg_controldata $PGDATA | grep -oP '(?<=NextOID:\s{10})\d+' ) -x $( LANG=C pg_controldata $PGDATA | grep -oP '(?<=NextXID:\s{10}0[/:])\d+' ) -f $PGDATA \ && echo '11(pair to 1)) Apt clean:' \ && apt-get autoremove -y \ curl \ && rm -rf /var/lib/apt/lists/*

As you can see, I tried to add comments directly to the file, maybe they are even more than exhaustive, but nevertheless we’ll dwell on a couple of moments in more detail.

Worth paying attention

We override the ENV PGDATA /var/lib/pgsql/data/ . This is the key point. because we want the completed data during the build to be included in the image ; we should not put it in a standard place defined as volume .
The variable DB_DUMP_URL defined simply for ease of later editing. If desired, it can be transferred from the outside, during the build.
Next, we run Postgres right during the build process: bash -c '/docker-entrypoint.sh postgres --autovacuum=off &' in order to run some simple configurations:
- Using sed , we mainly include include_dir in the main postgres.conf . We need this in order to minimize such manipulations with the config, otherwise they will be very difficult to maintain, but we have provided unlimited extensibility of the configuration! Please note, a little bit higher we use the COPY postgres.conf.d/* /etc/postgres/conf.d/ to put pieces of configs specific to our build.
- This mechanism was proposed to the community as an issue for inclusion in the basic image, and although I had already raised questions as I did (which suggested that this might be useful for someone and the reason for writing the article), while the request was closed, I I do not lose hope for a rediscovery.
- I also remove (comment) from the main file the chown and chmod instructions, since the database is initialized, then the files will already have the correct users and rights in the image, but it was empirically found out that on the docker version for Windows this operation for some reason may take a very long time, up to tens of minutes.
- Please also note that we must first run Postgres , and only then try to configure it! Otherwise we will get an error at the start that the directory for cluster initialization is not empty!
- Next, I restart Postgres to re-read the configs we put in and set up to read. Strictly speaking, this step is not at all mandatory. However, by default it has very conservative memory settings like shared_buffers = 128MB , and work with any significant ones is delayed by the clock.
In the next step, everything should be clear - just restore the dump. But behind it, the /init.sql/* construct /init.sql/* apply all SQL scripts from this directory during the image creation (as opposed to standard extension scripts). This is where we do the necessary obfuscation of data, sampling, cleaning, adding test users, etc.
- This execution of all scripts will allow the next time not to touch the build procedure, but only add a couple of files to this directory, doing something else with your data!
To reduce the image and also make it a bit more efficient, we perform a full vacuum on it with analysis .
- It is worth noting that this is what allows us to start Postgres with auto-vacuum turned off ( --autovacuum=off ) to speed up imports.
- Also, for the purpose of reducing the image, I continue to use pg_resetxlog to reset and not include accumulated WAL . And when I start, I use --max_wal_size=3GB to increase the file size and not rotate it once again.
- Cleaning the APT cache is standard following the best practice guidelines.

The finished image can only assign a tag and push into the repository. Most often, of course, it will be a private repository, if you are not working on some kind of public data sample.

I will be very happy if it helps someone to make his process of preparing the creation of test images with the data even a little easier.

Source: https://habr.com/ru/post/328226/

All Articles

Docker container with Postgres data for integration testing and easy extension

What do you want

Getting started

Worth paying attention

More articles: