Silent Ruby Exposures: Transactional Rails / PostgreSQL Thriller

This is the story of why you should never keep silent about mistakes when you are inside a transaction in a database. Learn how to properly use transactions and what to do when using them is not an option. Spoiler: it's about advisory locks in PostgreSQL!

I worked on a project in which users can import a large number of heavy entities (let's call them products) from an external service into our application. Each product is loaded with even more of a variety of related data from an external API. There is often a situation where a user needs to load hundreds of products along with all-all dependencies, as a result, the import of a single product takes considerable time (30-60 seconds), and the whole process can take so long. The user may be bored waiting for the result and he has the right to press the “Cancel” button at any time and the application should be useful with the number of products that were able to be loaded by this moment.

“Interrupted import” is implemented as follows: in the beginning, for each product, a temporary record-task is created in the table in the database. For each product, the background import task is launched, which downloaded the product, saved it to the database along with all the dependencies (it does everything in general) and right at the end deletes its entry-task. If at the moment when the background task starts, there will be no records in the database - the task just quietly ends. Thus, to cancel the import, simply delete all the tasks and everything.

It does not matter whether the import was canceled by the user or was completely completed by himself - in any case, the lack of tasks means that everything is over and the user can start using the application.

The design is simple and reliable, but there was one small bug in it. A typical bug report about him was: “After the import is canceled, the user is shown a list of his goods. However, if you refresh the page, the list of products is complemented by several entries. ” The reason for this behavior is simple - when the user pressed the Cancel button, he was immediately transferred to the list of all products. But at this time, the already started imports of certain goods are still “reaching”.

This, of course, is a trifle, but the users were puzzled by the order, so it would be nice to fix it. I had two ways: to somehow determine and "kill" already running tasks, or when I press the cancel button, wait until they finish and "die their own death" before pushing the user further. I chose the second way - wait.

Transactional locks rush to the rescue

For anyone working with (relational) databases, the answer is obvious: use transactions !

At the same time, it is important to remember that in most RDBMS, records that are updated within a transaction will be blocked and inaccessible for modification by other processes until this transaction is completed. Records selected using SELECT FOR UPDATE will also be locked.

Just our case! I wrapped the tasks of importing individual goods into a transaction and blocked the task entry at the very beginning:

 ActiveRecord::Base.transaction do task = Import::Task.lock.find_by(id: id) # SELECT … FOR UPDATE  «    » return unless task #  - ? ,    ! #     task.destroy end

Now, when the user wants to cancel the import, the import stop operation will delete the tasks for imports that have not yet started and will have to wait for the completion of the ongoing ones:

 user.import_tasks.delete_all #

Simple and elegant! I drove the tests, checked the imports locally and staging, and put them into battle.

Not so fast…

Satisfied with my work, I was quite surprised to find out soon bug reports and tons of errors in the logs. Many products were not imported at all . In some cases, only one single product could remain after the completion of the entire import.

Errors in the logs also did not inspire: PG::InFailedSqlTransaction with backtrace leading to code that executed innocent SELECTs. What is going on at all?

After a day of grueling debugging, I identified three main causes of the problems:

Inserting conflicting entries into the database.
Automatic cancellation of transactions in PostgreSQL after errors.
Hiding problems (Ruby exceptions) in the application code.

Problem One: Competitive Insertion of Conflicting Entries

Since each import operation takes up to a minute and there are many of these tasks, we perform them in parallel to save time. Dependent records for goods can intersect, to the point that all user products can refer to one single record created once and then reused.

To find and reuse the same dependencies in the application code, there are checks, but now, when we use transactions, these checks are useless : if transaction A has created a dependent record, but has not completed yet, then transaction B cannot find out about its existence and will try to create a duplicate record

Problem two: Automatic abolition of transactions in PostgreSQL after errors

We, of course, prevented the creation of duplicate tasks at the database level using the following DDL:

 ALTER TABLE product_deps ADD UNIQUE (user_id, characteristics);

If transaction A, which is still in progress, has inserted a new entry and, in parallel with it, transaction B tries to insert an entry with the same values of the user_id and characteristics fields — transaction B will receive an error:

 BEGIN; INSERT INTO product_deps (user_id, characteristics) VALUES (1, '{"same": "value"}'); -- Now it will block until first transaction will be finished ERROR: duplicate key value violates unique constraint "product_deps_user_id_characteristics_key" DETAIL: Key (user_id, characteristics)=(1, {"same": "value"}) already exists. -- And will throw an error when first transaction have commited and it is become clear that we have a conflict

But there is one feature about which we must not forget - transaction B will be automatically canceled after an error is detected and all the work done in it will be a waste. However, this transaction is still open in the "erroneous" state, but any attempt to execute any, even the most innocuous request, only errors will be returned in response:

 SELECT * FROM products; ERROR: current transaction is aborted, commands ignored until end of transaction block

Well, it is completely superfluous to say that everything that was entered into the database in this transaction will not be saved:

 COMMIT; --      ,   ROLLBACK --

Problem Three: Silencing Problems

By this point, it had already become clear that simply adding transactions to the application had broken it. There was no choice: I had to dive into the import code. In the code, the following patterns quite often caught my eye:

 def process_stuff(data) # ,   rescue StandardError nil #  ,  end

The author of the code here as if tells us: "We tried, we did not succeed, but that's okay, we continue without it." And although the reasons for this choice can be quite explainable (not everything can be processed at the application level), this is exactly what makes any logic based on transactions impossible: a “dropped” transaction cannot pop up to the transaction block and cannot cause a correct rollback. transactions (ActiveRecord catches all errors in this block, rolls back the transaction and throws them again).

Perfect storm

And that's how all these three factors came together to create the perfect ~~storm~~ bug:

An application in a transaction tries to insert a conflicting entry into the database and at the same time causes a "duplicate key" error from PostgreSQL. However, this error does not cause a rollback of the transaction in the application, as it is “silenced” inside one of the parts of the application.
The transaction becomes invalid, but the application does not know about it and continues its work. When any attempt to access the database, the application again receives an error, this time "current transaction is aborted", but this error can also be "thrown out" ...
You probably already understood that something in the application continues to break down, but no one will know about it until the execution reaches the first place where there is no overly greedy rescue and where the error may eventually emerge, be output to the log, registered in the error tracker - anything. But this place will be very far away from the place that became the root cause of the error, and this alone will turn debugging into a nightmare.

Alternative to transactional locks in PostgreSQL

Hunting for rescue in the application code and rewriting the entire import logic is not an option. Long. I needed a quick fix and postgres found it! He has a built-in solution for locks, an alternative to locking records in transactions, meet session-level advisory locks. I used them as follows:

First, I first removed the wrapping transaction. In any case, it is a bad idea to interact with external APIs (or any other side effects) from the application code with an open transaction, because even if you roll back the transaction along with all the changes in our database, the changes in external systems will remain , and the application as a whole may be in a strange and undesirable state. Heme isolator can help you make sure that side effects are properly isolated from transactions.

Then, in each import operation, I take a shared lock on a key that is unique for the entire import (for example, created from the user ID and hash from the name of the operation class):

 SELECT pg_advisory_lock_shared(42, user.id);

Shared locks on the same key can be taken simultaneously by any number of sessions.

At the same time, the operation of canceling the import deletes all task records from the database and tries to take an exclusive lock on the same key. At the same time, it will have to wait until all shared locks are released:

 SELECT pg_advisory_lock(42, user.id)

And it's all! Now the “cancellation” will wait until all the already “running” imports of certain goods are completed.

Moreover, now that we are not bound by a transaction, we can use a small hack to limit the waiting time for the import to be canceled (in case some import gets stuck), because it’s not good to block the web server’s flow for a long time (and force user wait):

 transaction do execute("SET LOCAL lock_timeout = '30s'") execute("SELECT pg_advisory_lock(42, user.id)") rescue ActiveRecord::LockWaitTimeout nil #    (     ) end

It is safe to catch the error outside the transaction block, since ActiveRecord will already roll back the transaction .

But what about the competitive insertion of the same records?

Unfortunately, I do not know a solution that would work well with competitive inserts. There are the following approaches, but they will all block parallel inserts until the first transaction completes:

INSERT … ON CONFLICT UPDATE (available starting from PostgreSQL 9.5) in the second transaction is blocked until the first one is completed and then returns the record that was inserted by the first transaction.
Block some common record in the transaction before running validations to insert a new record. Here we will wait until the record inserted in another transaction becomes visible and validations can not fully work.
To take some general recommendation blocking - the effect is the same as for blocking the general record.

Well, if you are not afraid to work with base level errors, you can just catch the uniqueness error:

 def import_all_the_things #   ,   Dep.create(user_id, chars) rescue ActiveRecord::RecordNotUnique retry end

Just make sure that this code is not already wrapped in a transaction.

Why are they blocked?

UNIQUE and EXCLUDE restrictions block potential conflicts by not allowing them to be recorded at the same time. For example, if you have a unique constraint on an integer column and one transaction inserts a row with a value of 5, then other transactions that also try to insert 5 will be blocked, but transactions that try to insert 6 or 4 will immediately execute successfully, without blocking. Since the minimal actual transaction isolation level in PostgreSQL is READ COMMITED , the transaction cannot see uncommitted changes from other transactions. Therefore, an INSERT with a conflicting value cannot be accepted or rejected until the first transaction commits its changes (then the second receives a unique error) or rolls back (then the insertion in the second transaction succeeds). Read more about this in the article from the author of EXCLUDE restrictions .

Prevent catastrophe in the future

Now you know that not all code can be wrapped in a transaction. It would be nice to make sure that no one else would wrap a similar code in a transaction in the future, repeating my mistake.

To do this, all of your operations can be wrapped in a small auxiliary module that will check if the transaction is not open before running the wrapped operation code (here it is assumed that all your operations have the same interface - the call method).

 #     module NoTransactionAllowed class InTransactionError < RuntimeError; end def call(*) return super unless in_transaction? raise InTransactionError, "#{self.class.name} doesn't work reliably within a DB transaction" end def in_transaction? connection = ApplicationRecord.connection # service transactions (tests and database_cleaner) are not joinable connection.transaction_open? && connection.current_transaction.joinable? end end #    class Deps::Import < BaseService prepend NoTransactionAllowed def call do_import rescue ActiveRecord::RecordNotUnique retry end end

Now, if someone tries to wrap a dangerous service in a transaction, then he will immediately get an error (if, of course, he will not be silent about it).

Results

Main lesson to be learned: be careful with exceptions. Do not handle everything, catch only the exceptions that you know how to handle and let the rest get to the logs. Never suppress exceptions (only if you are not 100% sure why you are doing this). The earlier the error is noticed, the easier it will be to debug.

And do not worry about transactions in the database. This is not a panacea. Use our isolator and after_commit_everywhere gems - they will help your transactions become completely foolproof.

What to read

Exceptional Ruby by Avdi Grimm . This little book will teach you how to handle existing exceptions in Ruby and tell you how to properly design an exception system for your application.

Using @Brandur's Atomic Transactions to Power an Idempotent API . His blog has many useful articles about application reliability, Ruby and PostgreSQL.

Source: https://habr.com/ru/post/418147/

All Articles