PostgreSQL. How to keep cats or the history of one migration

The story is taken from a real project. But since the real project is too boring (and under the NDA), this article uses a simplified example.

There was once a project. And he had a database. And there was a table in the database for storage, well, let's say, cats. Here it is:

CREATE TABLE cats ( id serial, cname varchar(20), ctype varchar(20), primary key(id) );

It's pretty simple: every cat has an id, a name, and a type.

Of course, we had business requirements for cats and their types. For example, we knew for sure that we have the big furry , neko and sudden danger types. They assumed that the types of long tail and sleeper-eater may appear. But we expected the requirements to change. And while it is not known what types will be needed in the end. Therefore, the varchar (20) data type was used.

After a long time and several releases, we finally made an accurate list of types of cats. By this time, there were already tens of millions of cats in the table with very different types, many of which are outdated. It was necessary to restore order, bring all the values in the table in line with the new requirements.
')

So let's apply the engineering approach:

let's build a theory
check it out with experiments
develop a practical solution based on the theory
apply the solution and evaluate the result.

We build a theory

Create an ENUM data type and list valid values in it. Then do the migration:

 CREATE TYPE cat_type AS ENUM ('big furry', 'small red', 'long tail', 'crafty hunter', 'sudden danger', 'sleeper-eater'); ALTER TABLE cats ALTER COLUMN ctype TYPE cat_type USING ctype::cat_type;

We still do not know that in this form migration will not work. Forgot about invalid values already existing in the table. We learn about this later, when we try to apply migration =)

So we forbid the creation of cats with an unacceptable value of type. And also reduce the size of the table and the size of the index across the ctype field. The size of the table is not so important, but reducing the index is good. We have already dealt with indexes that did not fit in RAM. And this, to put it mildly, not very useful indices.

Let's estimate what memory gains can be expected.

For storing a varchar value, 1–4 bytes per character are allocated (depending on the encoding) and 1 or 4 more bytes are stored for storing the length of the string (for more, see www.postgresql.org/docs/current/static/datatype-character.html ). In our case, this is 1 byte per character (utf8, Latin letters) and 1 byte per line length. Strings are 9-14 characters long. We assume that on average we have 12 bytes per value.

 > select pg_column_size('big furry'); 10 > select pg_column_size('sleeper-eater'); 14

It is known that enum values occupy 4 bytes, regardless of their length.

 > select pg_column_size('big furry'::cat_type); 4 > select pg_column_size('sleeper-eater'::cat_type); 4

One row in the table is:

27 bytes per row header
8 bytes id,
21 bytes cname (we assume that all cats have names of 20 characters each),
12 bytes ctype

Total: 68 bytes.

After migration will be 27 + 8 + 21 + 4 = 60 bytes. The difference is small, but for 50 million lines the total gain should be significant.
We have 2 indexes, by id and by ctype. The index on id will not change. The ctype index should decrease. We don’t know how the index memory is arranged, but we expect that if one value has decreased by 3 times, then the index will decrease by 2-3 times.

Experiment number 1

For the experiment, create two tables:

 CREATE TABLE cats1 ( id serial, name varchar(20), type varchar(20), primary key(id) );

 CREATE TYPE cat_type AS ENUM ('big furry', 'small red', 'long tail', 'crafty hunter', 'sudden danger', 'sleeper eater'); CREATE TABLE cats2 ( id serial, name varchar(20), type cat_type, primary key(id) );

Fill it with test data:

 CREATE SEQUENCE ss; INSERT INTO cats1 (name, type) SELECT substring(md5(random()::text), 0, 20), (ARRAY['big furry', 'small red', 'long tail', 'crafty hunter', 'sudden danger', 'sleeper eater']) [nextval('ss') % 5 + 1] FROM generate_series(1, 500000); INSERT INTO cats2 (name, type) SELECT substring(md5(random()::text), 0, 20), ((ARRAY['big furry', 'small red', 'long tail', 'crafty hunter', 'sudden danger', 'sleeper eater']) [nextval('ss') % 5 + 1])::cat_type FROM generate_series(1, 500000);

Yes, the names of our cats are rather strange. But for the experiment are suitable.

Create indices:

 CREATE INDEX cats1_index ON cats1(type); CREATE INDEX cats2_index ON cats2(type);

And let's see how much memory they occupied:

 SELECT pg_relation_size('cats1') AS table_size, pg_indexes_size('cats1') AS indexes_size; SELECT pg_relation_size('cats2') AS table_size, pg_indexes_size('cats2') AS indexes_size;

Theoretically, the rows in the first table occupy 68 * 500,000 = 34,000,000 bytes, in the second table 60 * 500,000 = 30,000,000 bytes. In practice, we see 34,136,064 and 30,121,984 bytes. The figures turned out to be close.

It is clear that the table is more complex than just 500,000 rows evenly one after the other. There are allocated memory pages of 8 KB. Pages have their own titles and other meta-information. And the values in the lines are somehow aligned (for more, see www.postgresql.org/docs/9.5/static/storage-page-layout.html ).

But what about indexes?
The pg_indexes_size function shows the total memory consumption for all indexes associated with the table, and not for each separately. But it does not matter, we can call it before creating an index on ctype and after. And then we will see that the index by id is 11,255,808 bytes, and the indices by ctype for the first table are 15,794,176 bytes, and for the second table it is 11,255,808 bytes.
Noticeably less, but not 2-3 times, as we expected. Why is that?

Experiment number 2

Create several simple tables containing only one column:

 CREATE TABLE t_bool (f bool); CREATE TABLE t_sint (f smallint); CREATE TABLE t_int (f int); CREATE TABLE t_bint (f bigint); CREATE TABLE t_c7 (f char(7)); CREATE TABLE t_c8 (f char(8)); CREATE TABLE t_c9 (f char(9)); CREATE TABLE t_c15 (f char(15)); CREATE TABLE t_c16 (f char(16)); CREATE TABLE t_c20 (f char(20));

Fill them with data:

 INSERT INTO t_bool (f) SELECT true FROM generate_series(1, 500000); INSERT INTO t_sint (f) SELECT 1 FROM generate_series(1, 500000); ... INSERT INTO t_c7 (f) SELECT 'abcdefg' FROM generate_series(1, 500000); ... INSERT INTO t_c20 (f) SELECT 'abcd efgh abcd efgh ' FROM generate_series(1, 500000);

Create indices:

 CREATE INDEX ON t_bool(f); CREATE INDEX ON t_sint(f); ... CREATE INDEX ON t_c20(f);

And let's see how much space the table and index take:

Data type	Byte per value	Size table	Index size
bool	one	18,128,896	11,255,808
smallint	2	18,128,896	11,255,808
int	four	18,128,896	11,255,808
bigint	eight	18,128,896	11,255,808
char (7)	eight	18,128,896	11,255,808
char (8)	9	22,142,976	15.794,176
char (9)	ten	22,142,976	15.794,176
char (15)	sixteen	22,142,976	15.794,176
char (16)	17	26,091,520	20,332,544
char (20)	21	26,091,520	20,332,544

We see that the sizes of the table and the index are the same in the ranges of values 1-8 bytes, 9-16 bytes and more than 16 bytes.
It seems that small optimizations, such as replacing an int with a smallint, provide little benefit. Well, except in some cases, when in one table there are many columns that can be so optimized.
Replacing varchar with enum gives a win if the varchar values are on average more than 8 bytes (longer than 7 characters).

We are developing a practical solution.

Now we know what to expect in practice, and are ready to realize our migration.
We return to our cats:

 CREATE TABLE cats ( id serial, cname varchar(20), ctype varchar(20), primary key(id) ); CREATE INDEX c1 ON cats(ctype);

We fill the table with data so that it contains invalid and NULL values.

 CREATE SEQUENCE ss; INSERT INTO cats (cname, ctype) SELECT substring(md5(random()::text), 0, 20), (ARRAY['big furry', 'small red', 'long tail', 'crafty hunter', 'sudden danger', 'sleeper-eater', 'black eye', 'sharp claw', 'neko', NULL]) [nextval('ss') % 10 + 1] FROM generate_series(1, 500000);

Trying to migrate:

 CREATE TYPE cat_type AS ENUM ('big furry', 'small red', 'long tail', 'crafty hunter', 'sudden danger', 'sleeper-eater'); ALTER TABLE cats ALTER COLUMN ctype TYPE cat_type USING ctype::cat_type;

And we find out that our naive ALTER TABLE does not work:

 ERROR: invalid input value for enum cat_type: "black eye"

And you need to write a function to convert the type:

 CREATE OR REPLACE FUNCTION cast_to_cat_type(ctype varchar) RETURNS cat_type AS $$ DECLARE res cat_type; BEGIN CASE ctype WHEN 'big furry' THEN res := 'big furry'; WHEN 'small red' THEN res := 'small red'; WHEN 'long tail' THEN res := 'long tail'; WHEN 'crafty hunter' THEN res := 'crafty hunter'; WHEN 'sudden danger' THEN res := 'sudden danger'; WHEN 'sleeper-eater' THEN res := 'sleeper-eater'; ELSE res := NULL; END CASE; RETURN res; END $$ LANGUAGE plpgsql;

And try again:

 ALTER TABLE cats ALTER COLUMN ctype TYPE cat_type USING cast_to_cat_type(ctype);

This time it worked. Just to show someone such a function is embarrassing. Oh, I think I just gave out my secret propensity to copy-paste =) Shh, let's pretend that I did not write this, but you did not see it, ok? And I will write differently:

 CREATE OR REPLACE FUNCTION cast_to_cat_type(ctype varchar) RETURNS cat_type AS $$ DECLARE res cat_type; BEGIN BEGIN res := ctype::cat_type; EXCEPTION WHEN others THEN res := NULL; END; RETURN res; END $$ LANGUAGE plpgsql;

Here it can be safely sent to code review.

Evaluate the result

What we got in the end? The sizes of the table and indexes before migration: 33,038,336 and 26,140,672 bytes. After migration: 28,581,888 and 22,511,616 bytes. Considering that in the real table we have not 500 thousand records, but 50 million, the gain will be substantial.
But under some conditions you can win even more. Suppose a business is not interested in cats of the wrong or unknown type, in requests they are excluded. Then you can exclude them from the index.

Use the partial index :

 CREATE index CONCURRENTLY c2 ON cats(ctype) WHERE ctype IS NOT NULL; DROP index c1;

And now the size of the indices is 18,014,208 bytes. Here, of course, everything will depend on how many cats we have turned out to be wrong.
Curious question, what to do next with the wrong cats. But this is a question for the business, not for the developer.

It remains to make sure that the correct values are inserted into the table, and incorrect values are not inserted:

 > INSERT INTO cats (cname, ctype) VALUES ('Murzik', 'crafty hunter'), ('Vasjka', 'sleeper-eater'); INSERT 0 2 > INSERT INTO cats (cname, ctype) VALUES ('Sharik', 'big dog'); ERROR: invalid input value for enum cat_type: "big dog"

Everything works as it should.

We also have a couple of interesting migration stories, and how difficult it is to live with large tables. Let's tell about it next time.

Yura Zhloba,
Web developer.

Source: https://habr.com/ru/post/306728/

All Articles