How do we catch Deadlock `s on PostgreSQL and repair them

Foreword

Situation: there is a high-loaded meta-game for our tanks called the Global Map . Such a step by step nastolka for teams where battles take place in a real tank client. At peak hours on the map, several thousand clan leaders produce game actions: they attack each other, move divisions, buy, sell, ~~rob korovanov~~ . In addition, there are a dozen services that can also make changes to the game situation: they throw up money, fine, add players to the clan, and so on.

All this inevitably leads to deadlocks. So, I want to tell you a story about how we keep these periodic problems within acceptable limits.

A little about the internal structure of the backend

The main database is PostgreSQL 9.5.
The transaction isolation level is the standard default READ COMMITED.
ORM - SQLAlchemy.

Part 1: Monitoring

How Deadlock Appears

When Deadlock occurs, an exception of the following type falls:
')

ERROR: deadlock detected
DETAIL: Process 18293 waits for ShareLock on transaction 639; blocked by process 18254.
Process 18254 waits for ShareLock on transaction 640; blocked by process 18293.
HINT: See server log for query details.
CONTEXT: while updating tuple (0,9) in relation "users"

The first thing you should pay attention to is the line:
HINT: See server log for query details.

Indeed, if we look at the server logs, we will see the following for the same place:

ERROR: deadlock detected

And further specifics:

DETAIL: Process 18293 waits for ShareLock on transaction 639; blocked by process 18254.
Process 18254 waits for ShareLock on transaction 640; blocked by process 18293.
Process 18293: update users set balance = balance + 10 where id = 2;
Process 18254: update users set balance = balance + 10 where id = 3;

HINT: See server log for query details.
CONTEXT: while updating tuple (0,9) in relation "users"

And finally, the request on which the error occurred:

STATEMENT: update users set balance = balance + 10 where id = 2;

Query logging is not required to be enabled.

Cool. But the first global problem for any more or less serious project is that you do not have access to server logs due to security policy. Sometimes there is no access at all. And sometimes you can ask for a site, but you have to wait. Sometimes it's 30 minutes, sometimes a day.

And I would like to receive such information immediately. In particular, if you have a Sentry in the project, and the development team receives most of the errors right away.

It is impossible to twist the server somehow so that it gives out such information to regular customers. Due to the security policy of the developers database. But, if your user has normal access to the database, without any restrictions on the performance of service functions there and without Row-Level security policies , then you can still organize access to such information.

Hand hold

We can convert our classes to manually receive similar information. And even more. To do this, after catching the exception about deadlock, we need:

Temporarily do not roll back anything in the current connection with the database and do not touch anything there at all.
Create another connection and execute the simplest query:
```
 SELECT * FROM pg_stat_activity; 
```
The results are put in a repository accessible for developers. For example, send by Sentry as an error.

Example: implementation of collecting information on SQLAlchemy deadlock

db.py

 from contextlib import contextmanager from sqlalchemy import create_engine from sqlalchemy.exc import OperationalError engine = create_engine('postgresql+psycopg2://postgres:12345678@localhost/postgres') def log_pg_stat_activity(): '''Log, write or send through Sentry pg_stat_activity''' debug_conn = engine.connect() for process in debug_conn.execute(''' SELECT pid, application_name, state, query FROM pg_stat_activity; ''').fetchall(): print(process) @contextmanager def connection(): conn = engine.connect() try: yield conn except OperationalError as ex: log_pg_stat_activity() raise @contextmanager def transaction(): with connection() as conn: with conn.begin() as trans: try: yield conn except OperationalError as ex: if 'deadlock detected' in ex.args[0]: log_pg_stat_activity() # Log exception print(ex) trans.rollback() else: raise

deadlock_test.py

 from multiprocessing import Process from time import sleep from threading import Thread from sqlalchemy.orm import sessionmaker from db import transaction def process1(): with transaction() as tran: tran.execute('UPDATE users SET balance = balance + 10 WHERE id = 3;') sleep(1) tran.execute('UPDATE users SET balance = balance + 10 WHERE id = 1 RETURNING pg_sleep(1);') def process2(): with transaction() as tran: tran.execute('UPDATE users SET balance = balance + 10 WHERE id = 1;') sleep(1) tran.execute('UPDATE users SET balance = balance + 10 WHERE id = 3 RETURNING pg_sleep(1);') if __name__ == '__main__': p1 = Thread(target=process1) p2 = Thread(target=process2) p1.start() p2.start() sleep(4)

Or on github .

In this case, we have a high probability that we will see which particular query broke our transaction by calculating it from the PID and seeing the current query .

But it also happens that by calculating the connection using the PID and looking at the query you can see at all the wrong query that gave us the deadlock, but some logic following it. After all, while you were catching an exception and opening a connection, the request we needed for catching could have ended. All we can do here is work through pgBouncer or its analogs to minimize connection time and use application_name .

application_name

Even if you received the request that caused the deadlock, you may still have difficulty understanding where in the logic it was called. And here the application_name field comes to the rescue. By default, it is initialized with not very useful information, but it can be changed. And what if we write there the place where we started the transaction?

No sooner said than done!

Example: implementation of application_name setting via SQLAlchemy

db.py

 from traceback import extract_stack @contextmanager def transaction(application_name=None): with connection() as conn: if application_name is None: caller = extract_stack()[-3] conn.execution_options(autocommit=True).execute("SET application_name = %s", '%s:%s' % (caller[0], caller[1])) with conn.begin() as trans: try: yield conn except OperationalError as ex: if 'deadlock detected' in ex.args[0]: log_pg_stat_activity() # Log exception print(ex) trans.rollback() else: raise

Or on github .

Voila Now you can quickly open files in the right places and watch the code.

	pid	application_name	state	query
one	8613	deadlock_test.py:10	idle in transaction (aborted)	UPDATE users SET balance = balance + 10 WHERE id = 1 RETURNING pg_sleep (1);
2	8614	deadlock_test.py:17	active	UPDATE users SET balance = balance + 10 WHERE id = 3 RETURNING pg_sleep (1);
3	8617		active	SELECT pid, application_name, state, query FROM pg_stat_activity;

We think about server logs

All this magic described above is good, but theoretically it might not work. Sometimes you still can not do without server logs, so you need to take two more steps:

Discuss a clear procedure for obtaining the necessary parts of server logs at a reasonable time with interested parties.
Do them in the format you want by changing the log_line_prefix in postgresql.conf . On a development machine, for example, you can: log_line_prefix = 'APP:%a PID:%p TR:%x ' .

Part 2: How to deal with deadlocks

Typically, these sections provide links to documentation about transaction isolation levels, types of locks, and offer to think and analyze in the context of your application. I'll do it too, but later. But first, I’ll just describe how we do it most often, because it’s already happened that the deadlocks are very similar to each other.

Several practices of avoiding deadlock `s

Frequent case number 1: Classic deadlock

Our most frequent case is as follows:

A game action occurs, affecting several players at once.
In this case, players can often perform several actions at the same time. For example, we have: large clans have many units involved in the battles.
When you win or lose points are added to the clan. And in parallel such charges can be very much.

That is, we have a situation when the heels of the results of battles came, their processing was parallelized, while in parallel processes, it happens that the same players are encountered.

All that can be done here is either to build charges in a chain, but this is slow, or to let charges fall and try to charge them a bit later.

Frequent case number 2: Himself angry Pinocchio (SSZB)

We have a hiking game. Once in a turn, a recalculation of the balance of players takes place, given the large number of game actions they have committed. At the time of the balance change, we blocked other changes through SELECT ... FOR UPDATE . Although we didn’t block everyone at once, but chunks of 100, anyway, we sometimes went into deadlock with a process that accrues bonuses for a fight that does not stop for the duration of the course calculation.

So, it turned out that we were wrong. SELECT ... FOR UPDATE is too powerful a lock, necessary only if 2 conditions are met:

The conditional id of the current table is used as a foreign key in another.
The same conditional id can be changed / deleted as a result of further actions.

Take an example:

 --P1: BEGIN; --P2: BEGIN; --P1: SELECT id FROM clans WHERE id=1 FOR UPDATE; --P2: INSERT INTO users(clan_id, name) VALUES(1, 'Alpha');

P2 in this situation will hang, because we give the DBMS to understand that the record with id = 1 may cease to exist. However, in P1 we do not do anything like that, but we want to protect the balance of the clan from changes. Therefore, when we changed FOR UPDATE to FOR NO KEY UPDATE , we stopped catching deadlocks.

Bonus number 1

SELECT ... FOR UPDATE is explicitly called in the example above. But you will get a similar effect if you affect your changes with the unique key referenced by the foreign key from other tables. And any UPDATE that doesn’t affect such changes with such keys will cause a lock similar to SELECT ... FOR NO KEY UPDATE . I recommend that you familiarize yourself with these features in the article “Explicit locks” in the list of references below.

Bonus number 2

Let us return to another interesting detail from the original error:

CONTEXT: while updating tuple (0,9) in relation "users"

What kind of stupid you ask? This is the physical address of the line in the table that caused the conflict. The fact is that in each table there are service fields that are not selected by the SELECT * query. However, it is necessary to clearly indicate for example ctid among the fields, as we will see this very stupid:

 SELECT ctid, id, nickname, balance FROM public.users;

There is little benefit from it in the case of deadlock, because an unlocked process will most likely update the conflicting line and this ctid will change (since any UPDATE in PostgreSQL is actually INSERT, and the old line is marked as invisible and later removed with an auto-vacuum). But it is worth knowing that it will ever come in handy.

What to read

Deadlocks in PostgreSQL - description of all types of locks
Explicit locking or explicit blocking (in Russian). More on blocking.
Lock Monitoring - scripts showing how to monitor locks
System Columns - about hidden service fields in the tables
Introduction to PostgreSQL physical storage - about physical data storage in PostgreSQL

Source: https://habr.com/ru/post/323354/

All Articles