📜 ⬆️ ⬇️

How do we catch Deadlock `s on PostgreSQL and repair them

Foreword


Situation: there is a high-loaded meta-game for our tanks called the Global Map . Such a step by step nastolka for teams where battles take place in a real tank client. At peak hours on the map, several thousand clan leaders produce game actions: they attack each other, move divisions, buy, sell, rob korovanov . In addition, there are a dozen services that can also make changes to the game situation: they throw up money, fine, add players to the clan, and so on.

All this inevitably leads to deadlocks. So, I want to tell you a story about how we keep these periodic problems within acceptable limits.



A little about the internal structure of the backend



Part 1: Monitoring


How Deadlock Appears


When Deadlock occurs, an exception of the following type falls:
')
ERROR: deadlock detected
DETAIL: Process 18293 waits for ShareLock on transaction 639; blocked by process 18254.
Process 18254 waits for ShareLock on transaction 640; blocked by process 18293.
HINT: See server log for query details.
CONTEXT: while updating tuple (0,9) in relation "users"

The first thing you should pay attention to is the line:
HINT: See server log for query details.

Indeed, if we look at the server logs, we will see the following for the same place:
ERROR: deadlock detected

And further specifics:

DETAIL: Process 18293 waits for ShareLock on transaction 639; blocked by process 18254.
Process 18254 waits for ShareLock on transaction 640; blocked by process 18293.
Process 18293: update users set balance = balance + 10 where id = 2;
Process 18254: update users set balance = balance + 10 where id = 3;

HINT: See server log for query details.
CONTEXT: while updating tuple (0,9) in relation "users"

And finally, the request on which the error occurred:

STATEMENT: update users set balance = balance + 10 where id = 2;


Query logging is not required to be enabled.

Cool. But the first global problem for any more or less serious project is that you do not have access to server logs due to security policy. Sometimes there is no access at all. And sometimes you can ask for a site, but you have to wait. Sometimes it's 30 minutes, sometimes a day.

And I would like to receive such information immediately. In particular, if you have a Sentry in the project, and the development team receives most of the errors right away.

It is impossible to twist the server somehow so that it gives out such information to regular customers. Due to the security policy of the developers database. But, if your user has normal access to the database, without any restrictions on the performance of service functions there and without Row-Level security policies , then you can still organize access to such information.

Hand hold


We can convert our classes to manually receive similar information. And even more. To do this, after catching the exception about deadlock, we need:


Example: implementation of collecting information on SQLAlchemy deadlock
db.py
 from contextlib import contextmanager from sqlalchemy import create_engine from sqlalchemy.exc import OperationalError engine = create_engine('postgresql+psycopg2://postgres:12345678@localhost/postgres') def log_pg_stat_activity(): '''Log, write or send through Sentry pg_stat_activity''' debug_conn = engine.connect() for process in debug_conn.execute(''' SELECT pid, application_name, state, query FROM pg_stat_activity; ''').fetchall(): print(process) @contextmanager def connection(): conn = engine.connect() try: yield conn except OperationalError as ex: log_pg_stat_activity() raise @contextmanager def transaction(): with connection() as conn: with conn.begin() as trans: try: yield conn except OperationalError as ex: if 'deadlock detected' in ex.args[0]: log_pg_stat_activity() # Log exception print(ex) trans.rollback() else: raise 


deadlock_test.py
 from multiprocessing import Process from time import sleep from threading import Thread from sqlalchemy.orm import sessionmaker from db import transaction def process1(): with transaction() as tran: tran.execute('UPDATE users SET balance = balance + 10 WHERE id = 3;') sleep(1) tran.execute('UPDATE users SET balance = balance + 10 WHERE id = 1 RETURNING pg_sleep(1);') def process2(): with transaction() as tran: tran.execute('UPDATE users SET balance = balance + 10 WHERE id = 1;') sleep(1) tran.execute('UPDATE users SET balance = balance + 10 WHERE id = 3 RETURNING pg_sleep(1);') if __name__ == '__main__': p1 = Thread(target=process1) p2 = Thread(target=process2) p1.start() p2.start() sleep(4) 


Or on github .

In this case, we have a high probability that we will see which particular query broke our transaction by calculating it from the PID and seeing the current query .

But it also happens that by calculating the connection using the PID and looking at the query you can see at all the wrong query that gave us the deadlock, but some logic following it. After all, while you were catching an exception and opening a connection, the request we needed for catching could have ended. All we can do here is work through pgBouncer or its analogs to minimize connection time and use application_name .

application_name


Even if you received the request that caused the deadlock, you may still have difficulty understanding where in the logic it was called. And here the application_name field comes to the rescue. By default, it is initialized with not very useful information, but it can be changed. And what if we write there the place where we started the transaction?

No sooner said than done!

Example: implementation of application_name setting via SQLAlchemy
db.py
 from traceback import extract_stack @contextmanager def transaction(application_name=None): with connection() as conn: if application_name is None: caller = extract_stack()[-3] conn.execution_options(autocommit=True).execute("SET application_name = %s", '%s:%s' % (caller[0], caller[1])) with conn.begin() as trans: try: yield conn except OperationalError as ex: if 'deadlock detected' in ex.args[0]: log_pg_stat_activity() # Log exception print(ex) trans.rollback() else: raise 


Or on github .

Voila Now you can quickly open files in the right places and watch the code.
pidapplication_namestatequery
one8613deadlock_test.py:10idle in transaction (aborted)UPDATE users SET balance = balance + 10 WHERE id = 1 RETURNING pg_sleep (1);
28614deadlock_test.py:17activeUPDATE users SET balance = balance + 10 WHERE id = 3 RETURNING pg_sleep (1);
38617activeSELECT pid, application_name, state, query FROM pg_stat_activity;

We think about server logs


All this magic described above is good, but theoretically it might not work. Sometimes you still can not do without server logs, so you need to take two more steps:


Part 2: How to deal with deadlocks




Typically, these sections provide links to documentation about transaction isolation levels, types of locks, and offer to think and analyze in the context of your application. I'll do it too, but later. But first, I’ll just describe how we do it most often, because it’s already happened that the deadlocks are very similar to each other.

Several practices of avoiding deadlock `s


Frequent case number 1: Classic deadlock


Our most frequent case is as follows:


That is, we have a situation when the heels of the results of battles came, their processing was parallelized, while in parallel processes, it happens that the same players are encountered.

All that can be done here is either to build charges in a chain, but this is slow, or to let charges fall and try to charge them a bit later.

Frequent case number 2: Himself angry Pinocchio (SSZB)


We have a hiking game. Once in a turn, a recalculation of the balance of players takes place, given the large number of game actions they have committed. At the time of the balance change, we blocked other changes through SELECT ... FOR UPDATE . Although we didn’t block everyone at once, but chunks of 100, anyway, we sometimes went into deadlock with a process that accrues bonuses for a fight that does not stop for the duration of the course calculation.

So, it turned out that we were wrong. SELECT ... FOR UPDATE is too powerful a lock, necessary only if 2 conditions are met:


Take an example:

 --P1: BEGIN; --P2: BEGIN; --P1: SELECT id FROM clans WHERE id=1 FOR UPDATE; --P2: INSERT INTO users(clan_id, name) VALUES(1, 'Alpha'); 

P2 in this situation will hang, because we give the DBMS to understand that the record with id = 1 may cease to exist. However, in P1 we do not do anything like that, but we want to protect the balance of the clan from changes. Therefore, when we changed FOR UPDATE to FOR NO KEY UPDATE , we stopped catching deadlocks.

Bonus number 1


SELECT ... FOR UPDATE is explicitly called in the example above. But you will get a similar effect if you affect your changes with the unique key referenced by the foreign key from other tables. And any UPDATE that doesn’t affect such changes with such keys will cause a lock similar to SELECT ... FOR NO KEY UPDATE . I recommend that you familiarize yourself with these features in the article “Explicit locks” in the list of references below.

Bonus number 2


Let us return to another interesting detail from the original error:

CONTEXT: while updating tuple (0,9) in relation "users"

What kind of stupid you ask? This is the physical address of the line in the table that caused the conflict. The fact is that in each table there are service fields that are not selected by the SELECT * query. However, it is necessary to clearly indicate for example ctid among the fields, as we will see this very stupid:

 SELECT ctid, id, nickname, balance FROM public.users; 

There is little benefit from it in the case of deadlock, because an unlocked process will most likely update the conflicting line and this ctid will change (since any UPDATE in PostgreSQL is actually INSERT, and the old line is marked as invisible and later removed with an auto-vacuum). But it is worth knowing that it will ever come in handy.

What to read


Source: https://habr.com/ru/post/323354/


All Articles