PostgreSQL on multicore servers Power 8

annotation

With the help of IBM’s Moscow office, we tested the performance of the latest PostgreSQL DBMS on Power8 servers, examined the scalability of performance versus simultaneous requests, found performance bottlenecks, offered new technical solutions, and achieved record-breaking performance.

Introduction

In a number of tasks, virtually unlimited scaling by the volume of processed transactions can be achieved using distributed systems, in which in one way or another the transaction flow is distributed to a large number of servers. Such scaling is often called “horizontal.” However, there is no universal distributed solution, moreover, distribution has its price. The system architecture must be designed in advance as distributed. Distributed systems are less flexible than monolithic, moreover, they are more difficult to operate and require higher qualification of staff. Some tasks are easier to parallelize, others more difficult. Therefore, the demand for high-performance monolithic systems exists, and achieving the best possible performance in one server was and remains an important task. This is often called “vertical scaling”.

The essence of the problems arising from the parallel processing of a large number of transactions in monolithic and distributed systems is the same - this is competition between transactions for access to the same resources. Speaking simply, individual processes work in parallel and independently until they queue up for a common resource (and this can be either a hardware resource or a piece of information stored in a database) and do not begin to expect each other.
')
To solve such problems, there are mechanisms for controlling access to resources — the use of locks, as well as lock-free approaches that are suitable in some cases. The productivity growth of these mechanisms, as well as the specification of locks, makes it possible to reduce the costs associated with simultaneous (competitive) access.

At the same time, while in distributed systems a network is usually a bottleneck, in monolithic systems close to peak performance, its growth is limited by the aforementioned simultaneous access control mechanisms.

PostgreSQL Scaling Studies

In 2012, at the PgCon conference, Robert Haas from EnterpriseDB presented the results of performance studies (number of read requests per second) by PostgreSQL depending on the number of concurrent client sessions (Fig. 1). Measurements were made on the amd64 platform using the pgbench tool included in PostgreSQL.

Fig.1. Schedule from R. Haas's report at the PgCon conference, 2012, case for 32 cores

Before analyzing this graph, we make some important remarks. The PostgreSQL architecture is such that one client is served by one single-threaded process on the server, and the disk is managed through a buffer located in the RAM shared between all such processes. Thus, on the one hand, a simultaneous independent operation of a certain number of client sessions is possible, but the data itself, which is in shared memory, is a shared resource that is accessed competitively. The disk subsystem is also a shared resource, but in this study work with it was not considered. All data were already in the buffers, so the measurements performed show the maximum achievable performance.

In this graph, the growth in the number of transactions per second (TPS) with an increase in the number of customers is initially close to linear; This means that the costs of access to shared resources are insignificant, when the next client appears, the next processor core is fully utilized. However, with increasing competition, linear growth ceases. Termination of linear growth with the number of parallel sessions, a smaller number of processor cores, says that the CPU is used inefficiently, and the costs of scalability are high.

Fig.1. shows that when upgrading from version 9.1 to 9.2, thanks to the improvement of locking mechanisms, a steady performance increase was observed up to at least 32 clients simultaneously working with CPUs (the test used a 32-core machine, therefore, with a further increase in the number of clients, performance did not increase anymore). A little later, R. Haas carried out similar measurements on 64 cores (Fig.2). It can be seen that close to linear vertical scalability is preserved even with the number of cores.

Fig.2. Graph from R. Haas’s report at the PgCon conference, 2012, case for 64 cores

These graphs and the research behind them inspired us to see how PostgreSQL will behave on modern versions on modern multiprocessor servers. IBM became interested in this, supplying such servers on the Power platform (the business related to x86 servers, IBM ceded to Lenovo in 2014 after the PC). That's what came of it.

Our research

In February 2015, the most well-known Russian developers of PostgreSQL founded Postgres Professional (Postgres Professional), whose goal was to develop PostgreSQL DBMS and provide a full range of related services. The creation of the company made it possible to increase the amount of work on the development of PostgreSQL in Russia several times, made it possible to conduct applied research and enter into partnerships with manufacturers of software and hardware. From the very beginning, Postgres Professional showed interest in partnering with IBM and entered into the partner network of IBM PartnerWorld. In May 2015, Terri Virnig, IBM Vice President for Power Ecosystem and Strategy, visited Moscow, which confirmed IBM’s high interest in PostgreSQL and its adaptation to achieve maximum performance on the Power platform.

As part of a partnership, in August 2015, an IBM E880 server with 2 nodes was selected for testing PostgreSQL in the IBM client center in Pokkepsiya (Poughkeepsie, NY), each node contained 4 eight-core SMT (simultaneous multithreading) processors, each core allows up to 8 threads. Thus, this configuration provides up to 2 * 4 * 8 * 8 = 512 execution threads (or in the terminology of the IBM “Logical CPU”, LCPU). It is important to note that the implementation of SMT in IBM Power 8 differs from its implementation in Intel processors (known as Hyperthreading), and, unlike the latter, gives a noticeable, up to twofold, performance boost on PostgreSQL DBMS tasks. The number of threads (LCPU) can be controlled by setting their number to 1, 2, 4, or 8 per CPU core. The testing technique was pre-tested on less powerful servers in Moscow. The tests, like those of R. Haas, were conducted using the pgbench utility included with PostgreSQL.

The test script was a search query on the B-tree index, which was entirely stored in memory. The PostgreSQL server was launched on a large LPAR, and with the help of the tools built into the Linux kernel under the NUMA architecture, it was fixed on a certain number of numa nodes. The success of the test was to achieve as many transactions per second (TPS) as possible.

The first launch showed that when adding more cores, the performance does not increase, but even drops.

Fig.3. Performance depending on the number of clients for different number of LCPUs

In order to understand why performance has fallen when adding LCPU, you need to imagine an IBM E880 server device. It has two motherboards, each of which has 4 processors. Among themselves, motherboards are connected by hyperconnect. More details about the internal structure of servers on IBM Power 8 can be read in a series of articles from IBM . The hyperconnect, though very fast, was predictably slower than the interaction of processors located on the same motherboard. And it will be noticeable not only in PostgreSQL, but also in any other system where processes interact through shared memory. Therefore, in further studies, in narrow places, we excluded the influence of the hyperconnect and used a single computing node, i.e. 256 LCPU with SMT = 8.

Initially, the last PostgreSQL official release (version 9.4.5) was tested at the time of testing, which showed linear scaling only up to 32 clients, then the current versions from the 9.5 and 9.6 branches, on which growth close to linear, continued to 64 clients.

Fig.4. Performance depending on the number of threads

Version 9.5 shows an improvement over version 9.4, which can be explained by the optimization of the LWLock mechanisms (to be explained below).

So, the measurements were taken. But we decided not to dwell on this, but to go further - trying to understand whether there is a possibility, having discovered a bottleneck in the code, to eliminate it and achieve a better result. To search for bottlenecks, the perf utility was used, which showed that about 32% of the CPU time is occupied by the execution of the s_lock function (Fig.5).

32.10% postgres [.] s_lock 7.77% postgres [.] GetSnapshotData 2.64% postgres [.] AllocSetAlloc 1.40% postgres [.] hash_search_with_hash_value 1.37% postgres [.] base_yyparse 1.36% postgres [.] SearchCatCache 1.32% postgres [.] PinBuffer 1.23% postgres [.] LWLockAcquire 1.05% postgres [.] palloc 1.01% postgres [.] ReadBuffer_common 0.99% postgres [.] LWLockRelease 0.94% libc-2.17.so [.] __memset_power7

Fig.5. Sample perf top output during testing

To identify the places in the PostgreSQL program code responsible for the s_lock call, the gdb debugger was used (the single-threaded perf performance for building the dependency graph was insufficient). Sample gdb connections showed that in most cases the s_lock call comes from the PinBuffer and UnpinBuffer functions, which serve to increase and decrease the reference count for the buffer that stores a specific data block. The reference counter is protected by the BufferDesc.buf_hdr_lock spinlock, and therefore, to work with it, you must first capture this spinlock. To better understand this place, let's consider what types of locks are in PostgreSQL, and why they are needed.

PostgreSQL locks: what and why so many different?

PostgreSQL implements several types of locks, each of which solves its own problems.

Spinlock (spinlock) - the simplest type of locks, which has only two states "busy" and "free." The procedure for taking a spinlock consists of successive attempts to change its state from “free” to “busy”. In order to avoid racing , spinlocks use the atomic Test-and-set (TAS) operation. Processes wishing to get a spinlock do not line up. From the side of the DBMS, neither monitoring of spinlockes nor automatic detection of deadlocks on spinlocks is supported. All this has been sacrificed for high performance in very short operations. PostgreSQL contains assembler implementations of spinlock for various hardware platforms. In the case when the assembler implementation of the spin-locks is absent, the emulation of the spin-locks via UNIX semaphores is used, which, however, have far less performance.
Lightweight lock (LWLock) is a more complex type of lock that has two levels, shared and exclusive. At the same time, either an unlimited number of shared locks or only one exclusive can be taken. The implementation of LWLock has undergone significant changes between PostgreSQL 9.4 and 9.5 versions. In versions 9.4 and below, LWLock contained a set of fields protected by a spinlock. Starting from version 9.5, most of the operations with LWLock are performed bypassing the spinlock, using atomic operations with the “state variable” of the LWLock. Processes that want to get LWLock are lining up, so that when you wait for LWLock for a long time, the bus resource is not wasted. Automatic detection of deadlocks for LWLock is not supported, the DBMS algorithms are adjusted so that deadlocks never occur. LWLock monitoring support is led by Postgres Professional employee Ildus Kurbangaliev (aka ildus ) as part of the more general task of monitoring wait events.
Heavyweight lock (HWLock) - the most “advanced” type of locks. It has many levels and a non-trivial conflict matrix . On tables, indexes and other database objects, HWLock'i, are set, both implicitly by the SQL commands that use them, and explicitly by the LOCK command. For monitoring HWLock'ov designed system view pg_locks . As the name implies, HWLock'i is a rather heavyweight mechanism, therefore, to receive them by DML commands, a special fastpath locking mechanism is provided. In addition, since the order in which HWLockes are retrieved depends on user actions, there can be no guarantee of the absence of deadlocks, so PostgreSQL has an automatic mechanism for detecting deadlocks.

In addition to the locks listed above, PostgreSQL has row-level locks, predicate locks, advisory locks and others, which, in turn, are implemented using the three basic types of locks listed above.

How we sped up PostgreSQL

To understand whether we can improve performance in the bottleneck we found, we, together with PostgreSQL lead developer Andres Freund (Andres Freund), developed a patch replacing the buf_hdr_lock spinlock for atomic operations with a “state” buffer. Thus, the developed patch implements optimization for buffers similar to that done for LW Lockers in version 9.5: the PinBuffer function started the Compare and Swap operation (CAS) in a loop, and the UnpinBuffer function - atomic decrement. This made it possible to extend the growth area close to linear, about 2 times and achieve a performance of more than 800 thousand TPS (Fig. 6).

Fig.6. The results of measuring the performance of the developed patch.

However, we did not stop at that either. If you look at the assembler implementation of the CAS operation under power8, you can see that it is a cycle (Fig. 7).

 #  : # r3 –  , r4 –   # r5 –    .L1: lwarx 9,0,5 cmpw 0,9,3 bne- 0,.L2 stwcx. 4,0,5 bne- 0,.L1 .L2: isync

Figure 7. Implementing an atomic CAS operation in Power 8 assembler.

Thus, performing a CAS operation in a loop is a loop in a loop. With a lot of competition for changing the value, it can be noticeably slower than performing a single cycle.

If you look at the assembler implementation of the atomic increment operation, you can see that it also represents a similar cycle, but within which the add command is already located.

 #  : # r3 –  # r5 –    .L1: lwarx 9,0,5 add 9,9,3 stwcx. 9,0,5 bne- 0,.L1 isync

Figure 8. The implementation of the atomic operation of the atomic increment in the assembler Power 8.

Obviously, more complex calculations can be placed between the lwarx and stwcx commands.

Based on this, two ideas were born for further optimizations.

Implement performance-critical features such as PinBuffer and LWLockAttemptLock with assembly inserts. The disadvantage of this option is that we go beyond the level of abstraction in the form of atomic operations, which the compiler provides us.
An alternative idea is to use an atomic increment and implement the so-called “optimistic approach”. In PinBuffer and LWLockAttemptLock, you can make an atomic increment of a “state variable” and then, by the actually set value, check whether we had the right to such a state change or not. In case we did not have the right to change the state in this way, cancel this change in the operation of the atomic increment. The calculation here is made that the change will have to be canceled only in a very small fraction of cases, this is the “optimism” of the approach. Of course, all other functions that work with the “state variable” should also take this into account.

Both of these ideas were implemented in the form of patches. At the same time, the version with the atomic increment was tested in several versions, so that the effect on the application of this approach to the PinBuffer and to the LWLockAttemptLock could be evaluated separately.

Fig.9. The results of measuring the performance of various patches.

The graph (Fig. 9) shows the performance comparison of the following patches:

pinunpin-cas - PinBuffer is performed using a CAS cycle of operations.
pinunpin-increment - PinBuffer is performed using an “optimistic” atomic increment.
pinunpin-cas-lwlock-increment - PinBuffer is performed using the CAS cycle of operations, LWLockAttemptLock is performed using the “optimistic” atomic increment.
pinunpin-lwlock-increment - PinBuffer and LWLockAttemptLock are performed using the “optimistic” atomic increment.
pinunpin-lwlock-asm - PinBuffer and LWLockAttemptLock are implemented in assembler.

It can be seen that the effect of the “optimistic” atomic increment in PinBuffer is almost nonexistent, but the application of the same approach to LWLockAttemptLock gives a very large effect. And about the same effect can be obtained using assembly optimizations.

Figure 10 shows the changed perf top picture after applying the patch. It can be seen that the bottleneck has moved to the function GetSnapshotData, which may also be accelerated. But the time allotted to us for testing is over and the best results could not be shown. However, the results achieved are of great value, since There was a significant increase in performance.

 13.75% postgres [.] GetSnapshotData 4.88% postgres [.] AllocSetAlloc 2.47% postgres [.] LWLockAcquire 2.11% postgres [.] hash_search_with_hash_value 2.02% postgres [.] SearchCatCache 2.00% postgres [.] palloc 1.81% postgres [.] base_yyparse 1.69% libc-2.17.so [.] __memset_power7 1.63% postgres [.] LWLockRelease 1.56% libc-2.17.so [.] __memcpy_power7 1.33% postgres [.] _bt_compare 0.99% postgres [.] core_yylex 0.99% postgres [.] expression_tree_walker

Fig.10 Example of perf top output during testing after applying the pinunpin-lwlock-asm experimental patch.

It should be noted that not only our team is engaged in vertical postgres scalability. There is a general trend towards more economical use of locks, which was, in particular, described at the October conference pgconf.eu in the report by Andres Freund). Including, as already mentioned, version 9.5 includes LWLock optimizations.

findings

In the experimental mode, the low-level locking system was replaced in PostgreSQL, which made it possible to approximately double the scalability of the number of processor cores and thereby increase the usability of PostgreSQL for efficient use on very large Power 8 servers.

Were tested patches containing optimizations of critical sections of code used in locks and the buffer manager, both in assembler and using atomic operations provided by the compiler. The results showed that when using atomic operations at a higher level of abstraction, it is possible to achieve almost the same level of performance as when using assembly inserts.

IBM's Advanced Toolchain OpenSource (which is essentially a version of gcc with PPC-optimized runtime libraries) showed better performance than vanilla gcc and is recommended for assembly. XLC under linux has still unresolved issues with PostgreSQL and is not recommended.

PostgreSQL showed solid performance in LPAR. Hardware partitioning with PowerVM turned out to be a powerful and convenient virtualization system, with which you can increase the performance of a single LPAR and PostgreSQL server inside it.

SMT in Power8 can give more than double the performance boost on tasks similar to those on which our testing was performed.

The latest patch on asm made it possible to achieve a total performance of over 2 million transactions per second on 48 Power8 cores. This was achieved on two instances of PostgreSQL running simultaneously on different compute nodes.

In practice, this means that you don’t have to worry about how much PostgreSQL scales up to the cores until you reach a performance of 1 million transactions per second on a top-end gland, or face a frequently used memory page (for example, the top B-line). tree). As a rule, our clients with> 100 physical cores have no problems with scaling PostgreSQL and do not feel the effects described here.

Thanks

This study was conducted as part of the activities of the Russian vendor PostgreSQL - Postgres Professional, a participant in the IBM Partner World program. The authors thank IBM’s Moscow office and Denis Sosnovtsev and Alexander Lunyakov personally for their interest in Postgres Professional research and assistance in gaining access to testing equipment, and Ivan Goncharov for valuable advice on its use, as well as IBM Vice President Terry Virng for support and IBM client center in Pukhipsi and IBM Cambridge Innovation Center for testing.

I would also like to thank my co-authors - Postgres Professional employees Dmitry Vasilyev (aka vadv ), who directly drove the tests and wrote a large part of this article, and Yury Zhuravlev (aka stalkerg ), who developed the initial version of the patch for translating PinBuffer / UnpinBuffer to atomic operations .

Source: https://habr.com/ru/post/270827/

All Articles