PostgreSQL 8.3

Josh Berkus announced the release of PostgreSQL 8.3beta1 (see the official Changelog ). It took more than half a year for developers to complete the patch processing work (recall, feature freeze took place on April 1, 2007). So it's time to tell what will please us this year, the most developed of the open databases in the world.

I will break the entire list into four parts. In the first, for many of the most important, part I will list the changes that somehow relate to performance. In the second, I’ll give you a list of new features for database programmers to further expand the already non-weak set of PostgreSQL features. The third part is devoted to innovations designed for database administrators. And finally, at the end I will mention some Open Source projects that are Postgres satellite projects (in other words, they have their own development cycle).

Performance

Let's start with the fact that today (at the moment a stable branch is 8.2, the current version is 8.2.5), PostgreSQL successfully in terms of performance not only with OpenSource alternatives, but also with leading commercial DBMS. Such as Oracle. This is no longer an empty sound - take a look at the results of testing conducted at Sun. Slow elephants are no more! The richest set of index types, the broadest possible tuning of the system, working with very large volumes and loads, a good choice of replication and scaling systems - all this is in the teeth of modern elephants. Even the development speed favorably distinguishes Postgres as compared to other DBMS: every year we always get a significant step forward.
')
What is new in PostgreSQL version 8.3 in terms of performance? Many changes are nontrivial. According to PostgreSQL development coordinator Bruce Momjian , who recently visited Moscow at the invitation of Postgresmen, the work on optimizing the performance of the system used by Postgres developers in recent years is extremely difficult. Each step requires more and more substantial labor costs, takes more and more time and effort of developers.

One of these really non-trivial changes can be safely considered a “chip” called HOT (Heap Only Tuples) . This is probably one of the most significant changes in performance. To understand what this change is about, you need to remember that PostgreSQL implements the so-called Multi-Version Concurrency Control ( MVCC) model.

The essence of HOT is as follows. Previously, before the implementation of this approach, when a row in the table was updated, each new version of it led to the appearance of new versions of all indexes, regardless of whether the changes affected the indexed columns or not (see fig. “Update without HOT”). Now, if the new version of the row falls into the same page of memory as the previous one, and the columns for which the index was created did not change, the index remains the same. But that is not all. If there is such an opportunity, there is an "instant" reuse of space in the Heap page. Which, of course, reduces the amount of work produced by the operation VACUUM. In fig. The HOT Update schematically shows how the row is now updated.

The next novelty will appeal to, above all, a large number of web developers. Starting from version 8.3, any transaction in PostgreSQL can be made "asynchronous" .

This means that when performing a commit transaction (COMMIT), the PostgreSQL server will not wait for the expensive transaction log synchronization operation (WAL fsync) to complete. In other words, the transaction will be considered successfully completed as soon as all the logical conditions are met (all necessary integrity constraints are checked). Physically, transaction logging will occur after a very short period of time (as a rule, for normally functioning systems, this is a maximum of 200-1000 ms). The environment variable synchronous_commit responds to the transaction state (synchronous / asynchronous). Going into asynchronous mode is simple:

SET synchronous_commit TO OFF;

It should be noted that asynchronous transactions are not an alternative to server operation with disabled fsync operations. The fact is that the fsync = off mode can lead to an inconsistent base state (for example, in case of an unexpected equipment failure or loss of power) and is recommended only in cases when high reliability equipment is used (for example, a disk controller with a battery). Using the same opportunity in no way can lead to a mismatch of data. The maximum that is possible is the loss of a small portion of the data (again, in the event of a hard server failure, OS errors, hardware, power failure). A typical example for asynchronous transactions can be the task of storing large amounts of information in a table-log (for example, a log of user actions) when the loss of several rows is not critical. However, all important transactions can still be synchronous.

Another performance improvement relates to situations where, when executing queries, PostgreSQL sequentially scans tables (SeqScan operation). If before version 8.3 in such cases, situations often arose where different Postgres processes did the same work at the same time — they looked at the same table — now, thanks to the implementation of Synchronized Scans (“synchronized scans”) , at the same moment time for one table can be spent no more than one operation of viewing. This is achieved as follows. If, within the framework of a session, SeqScan is required for a certain table for which SeqScan is already running (for another session), then there will be a “jump on the move” to the results of the already running SeqScan. Upon completion of this process, if necessary, the results will be “completed” with the help of another incomplete SeqScan (see figure).

The work on reducing the stress effect produced by the system's execution of the checkpoint process (“control point”) continues: now checkpoints are not performed immediately, but gradually: the process is “smeared” in time. Hence the name of this change - checkpoint smoothing . It is worth noting that with a regular shutdown of the server and the execution of an “explicit” checkpoint operation (the CHECKPOINT command), the data will be written to disk as fast as possible.

At the end of the performance talk, here’s a brief list of other changes to improve the performance of systems using PostgreSQL:

autovacuum is now enabled by default ;
in some situations, it is now possible to start several autovacuum processes at once (for example, during a long process of cleaning a large table, small tables are no longer forced to wait for the completion of this process);
a noticeable decrease in disk space occupied by databases: 1) due to varlena-type headers (data types of variable length: text, arrays, etc.), which previously occupied strictly 4 bytes, and now in some situations only 1 byte; 2) by saving the table row headers (previously 27 bytes, now - 24);
performing transactions that do not modify the data does not increase the value of the transaction counter (xid) , which significantly reduces the likelihood of the transaction counter overlapping (xid wraparound); In addition, this change is a significant step forward towards the implementation of the integrated Master-Slave replication based on the transfer of the transaction log - now there will be no desynchronization of the xid counter between the Master and Slave nodes;
The bgwriter process parameters auto-tuning mechanism is implemented (background writer is a special process that writes “dirty” shared buffers to a disk);
optimized the mechanism for obtaining results for queries using “ …ORDER BY … LIMIT… " (so-called, Top-N sorting ): in some cases, the system holds the necessary data pages in RAM, which provides a very high rate of issue results;
Now you can set (so far only in the form of two constants) the estimated cost of running the function and estimating the number of rows it returns, which allows the PostgreSQL planner to select the best query plan (example: ALTER FUNCTION log_data(text) COST 100000 ROWS 1 ).

Database developers

The most notable and significant change that should be noted here is the migration of the module for full-text search (contrib / tsearch2) to the core of the system . Developed by Russian developers Oleg Bartunov and Fyodor Sigaev , tsearch2 has long been the most popular contrib-module of Postgres. The patch for the migration of full-text search to the kernel, which was adopted this summer as a result of painstaking and long-lasting work (the adopted patch version is 58!) By several key developers of the PostgreSQL team, is the largest in the entire history of the project.

In addition, all the features of the tsearch2 module will now be available by default and the migration processes to the new PostgreSQL version will become much simpler, it will now become easier to configure dictionaries and word processing rules: all basic configuration operations are performed using SQL commands. So, for example, you can create a simple thesaurus dictionary:

  REATE TEXT SEARCH DICTIONARY thesaurus_astro (
     TEMPLATE = thesaurus,
     DictFile = thesaurus_astro,
     Dictionary = english_stem
 );
 ALTER TEXT SEARCH CONFIGURATION russian
     ADD MAPPING FOR lword, lhword, lpart_hword
         WITH thesaurus_astro, english_stem;

Simplified index creation processes. An example of creating a GIN index over a plain text column (without creating additional columns and triggers):

  CREATE INDEX pgweb_idx ON pgweb
     USING gin (to_tsvector ('russian', title || body));

Here is an example of a query with ranking by relevance, which also uses the special function plainto_tsquery to get tsquery (allows you to forget about escaping characters and quickly and simply convert plain text to tsquery):

  SELECT
     ts_rank_cd (textsearch_index, q) AS rank, title
 FROM
     pgweb, plainto_tsquery ('supernova star') q
 WHERE
     q @@ textsearch_index
 ORDER BY
     rank DESC LIMIT 10;

Another noticeable change is support for XML , in which the author of this article participated. This functionality is implemented in accordance with the SQL: 2003 standard (14th part of the standard, SQL / XML).

First of all, a special xml data type has been xml , embedded in the kernel. When using this type, the server checks whether the data is correctly formed (checking for well-formedness ). And there are possible use cases in which work with parts of the document is allowed (this allows us to provide the property of “closedness” of functions for working with XML on the xml data type).

In accordance with the SQL: 2003 standard, a set of functions for converting relational data to XML (so-called SQL / XML publishing functions) has been implemented. Here is a simple example of a request to generate XML data:

  SELECT XMLROOT (
    XMLELEMENT (
       NAME 'some',
       XMLATTRIBUTES (
          'val' AS 'name',
          1 + 1 AS 'num'
       ),
       XMLELEMENT (
          NAME 'more',
          'foo'
       )
    ),
    VERSION '1.0',
    STANDALONE YES
 );

In addition, support for DTD validation ( xmlvalidatedtd() function), support for evaluating XPath expressions ( xpath() function, returning an array of xml data), and alternative functions for simplified publication of relational data as XML ( tabletoxml() functions tabletoxml() , querytoxml() and others).

To speed up the execution of a request for XML data, it is possible to use functional btree-indexes and GIN-indexes, as well as the use of full-text search for XML data. Here is an example of creating a btree-index based on the results of evaluating an XPath expression:

  CREATE INDEX i_table1_xdata ON table1 USING btree (
    ((xpath ('// person / @ name', xdata)) [1])
 );

As for data types, PostgreSQL 8.3 introduces a number of innovations: in addition to the tsquery/tsvector and xml types built into the kernel, the following types tsquery/tsvector appeared:

enum (user-defined enumerated data types) for the convenience of some users, including those migrating from TheirSQL;
GUID / UUID data types (as a contrib module);
arrays of composite types (for example, user-defined types).

And finally, a short list of other changes:

automatic invalidation of query plan cache for PL / pgSQL functions;
the CREATE FUNCTION … RETURNS TABLE and RETURN TABLE… constructs for creating functions that result in a table;
support for update operation for cursors ;
standard (ISO / ANSI SQL) ORDER BY … NULLS FIRST/LAST construction to simplify setting the order of NULL values (also helps when migrating from other DBMS);
indexing of NULL values in GiST indexes.

Database administrators

This section turned out to be short, because much of what is meant to improve the life of a DBA is described above :-) However, we will briefly tell you about what remains.

The query plans ( EXPLAIN ANALYZE command) now show which sorting algorithm was selected and how much memory was spent :

  QUERY PLAN
 -------------------------------------------------- -----
  Sort (cost = 34.38..34.42 rows = 13 width = 176) (actual time = 0.946..0.948 rows = 6 loops = 1)
    Sort Key: obj2tag.o2t_tag_name
    Sort Method: quicksort Memory: 18kB <- see here!
    -> Hash Join (cost = 19.19..34.14 rows = 13 width = 176) (actual time = 0.812..0.835 rows = 6 loops = 1)
 [...]

The special pg_standby contrib module written by Simon Riggs will simplify the work of administrators setting up a Warm Standby server based on WAL transfer logging. The module is written in pure C, therefore it is easily extensible and portable to new platforms (the performance has been tested already, at least on Linux and Win32).

When defining a function, it is now possible to override environment variables that will act only within the framework of the execution of this function ( binding the values of variables to functions ). For example, this is how you can specify that executing the function log _data() switches the transaction to asynchronous mode:

  ALTER FUNCTION log_data (text)
         SET synchronous_commit TO OFF;

Well, according to tradition, a short list of other new products in this section:

GSSAPI interface support;
improved build on the Win32 platform (now MinGW is not required, build is carried out in MS VC ++, which among other things leads to improved performance in Windows);
creating tables in a similar way with indexes (example: CREATE TABLE dict2 (LIKE dictionary INCLUDING INDEXES )).

Additional projects

EnterpriseDB (whose employees are active developers of PostgreSQL, many changes to version 8.3 in the area of performance are precisely their merit) released the debugger pldebugger , which is a contrib-module that allows you to debug PL / pgSQL functions in the standard pgAdminIII administration tool and perform profiling.

The project currently exists in the form of an independent contrib module (presented at PgFoundry ) and runs on a large number of platforms (including Linux and Win32). It is worth noting that this module works with version 8.2 of Postgres.

As we told not so long ago , Skype (which uses PostgreSQL in the well-known project of the same name) released several products in Open Source that could be useful to a large circle of developers. Among them, first of all, it is worth noting the pseudo-language PL / Proxy , which allows you to organize horizontal scaling with almost no restrictions (provided that all the business logic of the application is implemented as stored procedures), an extremely easy connection manager PgBouncer . Take a look at the Skype Developers Zone page, you will find a lot of interesting things!

At the turn of spring and summer of 2007, version 1.0 of a simple and convenient tool for analyzing pgFouine logs came out . This program will help you to know what your processor (s) of the database server did. pgFoiune analyzes Postgres request logs (when enabling query logging, it is recommended to introduce a time limit from below, see the description of the log_min_duration_statement parameter), providing reports on the slowest requests, errors and general statistics (see examples ). Thus, this tool allows the database developer to understand which queries can be improved in order to speed up the work of an application using PostgreSQL.

And finally, briefly about the other products:

The pgSNMP project is an SNMP agent implementation for PostgreSQL (server status monitoring);
SEPostgres is an extension based on the SELinux enhanced security model;
A tool has been created that provides recommendations to the database administrator for creating indexes and shows a possible plan for fulfilling a query subject to the availability of such indices ( Index Advisor );
In the well-known web-administration tool phppgadmin , the options for setting up a Slony cluster, full-text search, and auto- vacuum parameters have appeared (or are about to appear).

Conclusion

Version 8.3 is another step towards a complete database management system for corporate use. Nontrivial improvements in performance, the emergence of opportunities that are dictated by user needs, the expansion of many satellite projects - all this demonstrates the confident and rapid development of PostgreSQL.

When writing this review, the author used the following sources:

pgwiki / WhatsNew83
pgwiki / Feature_Matrix
pgwiki / 8.3_Changelog
pgwiki / Todo: PatchStatus
Bruce Momdjan's report at the Highload-2007 conference, Moscow ( pdf ).
PostgreSQL Official Documentation 8.3 .
Archive mailing pgsql-hackers .

Source: https://habr.com/ru/post/14933/

All Articles