Million lines per second from Postgres using Python

asyncpg is a new Python open-source library for working with PostgreSQL. It was written using asyncio and Python 3.5. asyncpg is the fastest PostgreSQL driver among similar implementations in Python, NodeJS and Go.

Why asyncpg?

We are creating EdgeDB - a new generation database, with PostgreSQL on the backend. We need high performance, low access latency and additional features of PostgreSQL itself.
')
The most obvious option is to use psycopg2 - the most popular Python driver for working with PostgreSQL. He has a great community, he is stable and time-tested. There is also aiopg, which implements an asynchronous interface, on top of psycopg2. Then the question is obvious - why write your bike? The short answer is PostgreSQL performance and support. Below we will look at this in more detail.

Special features

Data Type Support

Our main discontent with psycopg2 was mediocre processing of various types of data, incl. arrays and compound types. Many different data types are one of the distinguishing features of PostgreSQL. And yet, out of the box psycopg2 only supports simple data types - integers, strings, time and dates. It makes writing your types for something else.

The main reason for this lies in the fact that psycopg2 communicates with the database server data in text format, which is why a lot of data have to be parsed, this is especially true for composite data types.

Unlike psycopg2, asyncpg uses PostgreSQL binary I / O protocol, which, in addition to performance advantages, also has native support for container data types (arrays, range and composite types).

Prepare requests

Asyncpg also uses prepared PostgreSQL statements. This is a great opportunity for optimization, as it allows you to avoid re-parsing, analyzing and building a query plan. In addition, asyncpg caches I / O data for each prepared operator.

Prepared statements in asyncpg can be created and used directly. They provide an API for retrieving and analyzing query results. Most query methods create a connection directly, and asyncpg creates and caches prepared queries.

Ease of deployment

Another important feature of asyncpg is the absence of dependencies. Directly implementing the PostgreSQL protocol means that you do not need to install libpq. Just run pip install asyncpg . In addition, we also provide a package for manual build on Linux and macOS (Windows is planned for the future).

Performance

It soon became clear to us that by implementing the PostgreSQL protocol directly, we can achieve a significant increase in speed. Our past experience with uvloop has shown that you can create efficient and productive libraries with Cython. asyncpg is completely written in Cython with memory management and high optimization. As a result, asyncpg was on average 3 times faster than psycopg2 (or aiopg).

Testing

As for uvloop , we have created a separate utility for testing pgbench and creating reports for asyncpg and other implementations of the PostgreSQL driver. We measured the speed of the query (lines per second) and the delay. The main purpose of these tests is to find out the overhead for this driver.

To be honest, all the tests were run in one thread (GOMAXPROCS = 1 in the case of Golang) in asynchronous mode. Python drivers were run using uvloop.

This test took place on a clean server with this configuration:

CPU: Intel Xeon E5-1620 v2 @ 3.70GHz, 64GiB DDR3
Gentoo Linux, GCC 4.9.3
Go 1.6.3, Python 3.5.2, NodeJS 6.3.0
PostgreSQL 9.5.2

Used drivers:

Python: asyncpg-0.5.2, psycopg2-2.6.2, aiopg-0.10.0, uvloop-0.5.0. aiopg is a tiny wrapper over psycopg2, for asynchronous operation
NodeJS: pg-6.0.0, pg-native-1.10.0
Golang: github.com/lib/pg@4dd446efc1, github.com/jackc/pgx@b3eed3cce0

The graphs show the average values of the results obtained by running 4 different types of queries:

- Direct query on the selection of all rows from the pg_type table (about 350 rows). It is quite close to the total number of application requests. In this test, asyncpg shows the header performance of 1 million lines per second. More details.

- The query generates 1000 lines consisting of a single integer. This test is designed to view performance when creating records and getting results. More details.

- The query returns 100 rows, each of which contains 1 KB of binary data (blob). This is a stressful I / O test. More details.

- The query returns 100 lines, each of which contains an array of 100 integers. This test is designed to test the decoding speed of arrays. Here, asyncpg turned out to be slower than a faster implementation on golang. This is related to the cost of creating and deleting tuples in Python. More details.

Conclusion

We are sure that using Python it is possible to create high-performance and scalable systems. To do this, we need to make efforts to create fast, high-quality drivers, event cycles, and frameworks.

asyncpg is one of the steps in this direction. This is the result of meaningful design, powered by our experience in creating uvloop, as well as the effective use of Cython and asyncio.

Translation of article 1M rows / s from Postgres to Python

Source: https://habr.com/ru/post/317394/

All Articles