Examples of real patches in PostgreSQL: part 2 of N

In previous articles, we looked at the PostgreSQL development process , as well as examples of some of the real patches that have been adopted by this RDBMS recently. At the same time, the considered patches were, frankly speaking, some “frivolous” - correction of typos, correction of the simplest jambs found using static analysis, and so on.

Today we will look at examples of more serious patches, eliminating bottlenecks in the code, correcting quite serious bugs, relatively large refactorings, and so on. As before, the main purpose of the article is not so much to highlight the changes adopted in PostgreSQL 9.6, but to show that the development of open source projects, in particular PostgreSQL, is interesting and not as difficult as it may seem.
')
If this topic is interesting to you, please under cat.

6. Accelerating ResourceOwner for a large amount of resources.

ResourceOwner is an object (as far as the word “object” is applicable to the procedural language C), designed to manage resources during the execution of SQL queries. For each transaction and subtransaction, a separate ResourceOwner is created. ResourceOwner has many methods like RememberLock / ForgetLock, RememberFile / ForgetFile and similar. In addition, ResourceOwners can be built in a hierarchy. In the case of a rollback of a transaction for any reason (the user said rollback, an exception occurred, etc.), we simply release ResourceOwner, and this release in turn leads to the release of all the resources in use in this ResourceOwner and his "children." Details can be found in the corresponding README file .

In 9.5, ResourceOwner used arrays to store resources. It was assumed that resources are usually released in the reverse order to the one in which they were allocated, therefore Forget * methods searched for resources from the end of the array. In practice, however, it turned out that this approach does not always work well. So profiling has shown that when executing the simplest SELECT queries to a table with a large number of partitions with this approach, PostgreSQL spends 30% of the time in these Forget * methods.

It was possible to eliminate bottleneck by replacing the arrays with hash tables. However, if the number of resources in ResourceOwner is small, then arrays are used, as before:

/* * ResourceArray is a common structure for storing all types of resource IDs. * * We manage small sets of resource IDs by keeping them in a simple array: * itemsarr[k] holds an ID, for 0 <= k < nitems <= maxitems = capacity. * * If a set grows large, we switch over to using open-addressing hashing. * Then, itemsarr[] is a hash table of "capacity" slots, with each * slot holding either an ID or "invalidval". nitems is the number of valid * items present; if it would exceed maxitems, we enlarge the array and * re-hash. In this mode, maxitems should be rather less than capacity so * that we don't waste too much time searching for empty slots. * * In either mode, lastidx remembers the location of the last item inserted * or returned by GetAny; this speeds up searches in ResourceArrayRemove. */ typedef struct ResourceArray { Datum *itemsarr; /* buffer for storing values */ Datum invalidval; /* value that is considered invalid */ uint32 capacity; /* allocated length of itemsarr[] */ uint32 nitems; /* how many items are stored in items array */ uint32 maxitems; /* current limit on nitems before enlarging */ uint32 lastidx; /* index of last item returned by GetAny */ } ResourceArray;

This patch also includes ResourceOwner refactoring. Previously, for each type of resource, a separate array of files, heaptuples, and so on was used. All of these types are either pointers or integers, and therefore can be stored in a Datum (the local equivalent of uintptr_t). A new ResourceArray entity was introduced, allowing to store any resources, which saved a significant amount of duplicate code.

Commit: cc988fbb0bf60a83b628b5615e6bade5ae9ae6f4
Discussion: 20151204151504.5c7e4278@fujitsu

7. Freelist Partitioning for a shared dynahash

Dynahash (see file dynahash.c ) is a local implementation of hash tables. Hash tables in PostgreSQL can behave very differently depending on the flags with which they were created. For example, they can live in the local memory of the process, as well as in shared memory . When using the latter, shared memory is mapped to the same virtual addresses in all PostgreSQL processes. The shared memory is allocated once and the amount of this memory cannot be changed during the operation of the RDBMS.

For these reasons, the so-called freelist is used to keep track of free memory in shared hash tables - a list of free pieces of small memory. When freeing memory, it is added to the freelist. When you need to allocate memory, it is taken from freelist. Since access to a shared hash table is carried out at once by several processes, access to the freelist is synchronized using a spinlock. It turned out that certain congestion causes lock contention during this spinlock.

The patch finally adopted resolves this problem as follows. Instead of one freelist, several are used (32), each with its own spinlock.

It was:

  struct HASHHDR { slock_t mutex; /* unused if not partitioned table */ long nentries; /* number of entries in hash table */ HASHELEMENT *freeList; /* linked list of free elements */ /* ... */

It became:

 #define NUM_FREELISTS 32 typedef struct { slock_t mutex; /* spinlock */ long nentries; /* number of entries */ HASHELEMENT *freeList; /* list of free elements */ } FreeListData; struct HASHHDR { FreeListData freeList[NUM_FREELISTS]; /* ... */

The default allocation for memory allocation is freelist, the number of which is determined by the lower bits of the hash value from the key:

 #define FREELIST_IDX(hctl, hashcode) \ (IS_PARTITIONED(hctl) ? hashcode % NUM_FREELISTS : 0)

However, if the memory in “our” freelist is over, it is “borrowed” from other freelists.

Among other things, the patch is interesting because before accepting it, I had to write about 15 of its versions, actually sorting through all possible sharding strategies of freelists, their number, and other parameters, choosing one option that showed the best performance. For example, instead of the 32 spin locks used in the final implementation, one could use one RW Lock, captured for reading, if we want to take the memory from “our” freelist, and for the record — if we borrow from others. Plus, the spinlock can be arranged in different ways in memory, with or without alignment to the size of the cacheline, and so on.

Commit: 44ca4022f3f9297bab5cbffdd97973dbba1879ed
Discussion: 20151211170001.78ded9d7@fujitsu

8. Support for multiple iterators in RB trees

While working on the next feature, I noticed that the iteration interface on red-black trees (currently used exclusively in GIN indexes) in PostgreSQL looks like this:

 void rb_begin_iterate(RBTree *rb, RBOrderControl ctrl); RBNode *rb_iterate(RBTree *rb);

You may notice that this interface does not allow you to create more than one iterator on the tree, which is rather inconvenient. Moreover, the implementation was very strange. For example, she kept the state of iteration in the nodes of the tree.

Thinking a little, I rewrote all this economy, after which the interface was the following:

 void rb_begin_iterate(RBTree *rb, RBOrderControl ctrl, RBTreeIterator *iter); RBNode *rb_iterate(RBTreeIterator *iter);

You can learn more about the various containers used in PostgreSQL from the article Not dull post about lists and search trees in the C language . In addition, you may be interested in the GitHub repository that I created in the process of working on this task. In it you will find the implementation of one- and two-linked lists, red-black trees and hash tables in C. The library is richly covered with tests and distributed under the MIT / BSD license.

Commit: 9f85784cae4d057f307b83b0d33edede33434f04
Discussion: 20160727172645.3180b2e0@fujitsu

9. Correction of checksum validation in pg_filedump for tables with several segments

PostgreSQL stores table and index data in so-called pages. The size of one page is 8 KB by default. Pages are stored in files on disk called segments. The size of one segment is by default equal to 1 GB. Cutting relationships and indexes into segments allows PostgreSQL to work even on a file system that does not support files larger than 1 GB. Using pages, caching of frequently used data in memory by the so-called buffer manager is implemented, which significantly reduces the number of disk accesses.

The pg_filedump utility allows you to do various useful things with segments and pages. For example, she can check the checksums of all pages in a segment. Checksums are written to pages if the database was created by invoking initdb with the -k flag:

  -k, --data-checksums use data page checksums

Interestingly, the pg_checksum_page procedure, which calculates the page’s hash function, depends not only on the page content, but also on the block number:

 uint16 pg_checksum_page(char *page, BlockNumber blkno)

This allows you to make sure that the page not only stores the correct data, but is also recorded at the correct offset in the segment.

So, recently in pg_filedump such a bug was discovered. Cheksummas were correctly checked for the zero segment, but for the first, second, and so on segments of the cheksum, considered pg_filedump, did not agree with those that PostgreSQL itself counted. As it turned out, for any segment pg_filedump started counting block numbers from scratch. The correct way is to take into account all the previous segments, and use for this segment "absolute" numbers of blogs, not "relative".

For obvious reasons, in the same patch, support for two previously missing flags was added to pg_filedump:

  -s Force segment size to [segsize] -n Force segment number to [segnumber]

Commit: 052ed0112967dd1e9b0e2cbe54821c04475f1a3a
Discussion: (only offlist)

10. Checking the value returned by malloc (), realloc () and other procedures

Finally, I decided to leave a patch written not by me, but for which I acted as a reviewer. In the process of code review, I suggested a lot of improvements for this patch.

Michael Paquier noted that in some places, PostgreSQL does not check the return codes of the malloc (), realloc (), and strdup () procedures. During the work on the patch, the list of procedures was supplemented with calloc (), as well as procedures for working with shared memory.

As a result, where possible, the calls were replaced with similar secure PostgreSQL analogues - pg_strdup, pg_malloc and others:

 - steps = malloc(sizeof(Step *) * nsteps); + steps = pg_malloc(sizeof(Step *) * nsteps);

In other places, checks were simply added:

  new_environ = (char **) malloc((i + 1) * sizeof(char *)); + if (!new_environ) + { + write_stderr("out of memory\n"); + exit(1); + }

See also the post of Michael himself - Postgres 10 highlight - ShmemAlloc and ShmemAllocNoError .

Commits: 052cc223 , 6c03d981
Discussion: CAB7nPqRu07Ot6iht9i9KRfYLpDaF2ZuUv5y_+72uP23ZAGysRg@mail.gmail.com

To be continued...

Of course, provided that such posts are of interest to someone :) Perhaps I will also be able to persuade some of my colleagues to light up the patches they have been working on lately. After all, who can tell about the patch better than the developer of this patch?

As always, I look forward to your questions, and I will be happy to answer them in the comments. And in general, feel free to leave any comments and additions!

Continued: Examples of real patches in PostgreSQL: part 3 of N

Source: https://habr.com/ru/post/310372/

All Articles