On the issue of cache invalidation

Cache disability is probably one of the most complicated things in programming. The subtlety of the question is a compromise between the completeness, redundancy and complexity of this procedure. So what is this article about? I would like not to become attached to any platform, language or framework, to think about how to implement a system of invalidation. Well, in order not to write about everything and about anything, let's concentrate on caching the results of SQL queries built using ORM, which in our time are often found.

Completeness and redundancy

Let's start with general considerations that are not specific to either SQL queries or ORM. I define the completeness and redundancy as follows. The completeness of the invalidation is its characteristic, which determines how often and in what cases a situation may / will arise when the cache contains dirty data and how long it will remain there. Redundancy, in turn, is how often the cache will be disabled without the need.

Consider, for example, a common way of time invalidation. On the one hand, it almost guarantees that immediately after changing the cache data is dirty. On the other hand, the time that the cache remains dirty, we can easily limit by reducing the lifetime (which in turn will reduce the percentage of hits). Those. while reducing the lifetime of the cache, the completeness of the invalidation improves, and the redundancy worsens. As a result, in order to achieve perfect completeness of invalidation (no dirty data), we must set the timeout to 0, or, in other words, disable the cache. In many cases, the temporary obsolescence of data in the cache is acceptable. For example, as a rule, it is not so terrible if the news appears in the latest news block a few minutes later or the total number of users of your social network is indicated with an error of a couple thousand.

Event Disability

The time invalidation method is good in its simplicity, however, it is not always applicable. Well, you can reset the cache when the data changes. One of the problems with this approach is that when you add a new query, which we cache, you have to add code to invalidate it when data changes. If we use ORM, then the data changes (in a good case) in one place - while saving the model. The presence of one central data change code facilitates the task, however, with a large number of various queries, you have to add more and more new lines to reset various cache pieces all the time. Thus, we get the excess code connectivity on our head. It's time to weaken it.
')
Let's use the events - when saving or deleting the model, the ORM will generate events, and when we cache something, we will immediately hang the handler on the corresponding event, which deletes this something from the cache. Everything is fine, however, writing a large number of similar handlers is tiresome, plus the application logic is overwhelmed with the logic of caching / invalidation like a pig with fat.

Automatic invalidation of ORM requests

Recall that we have an ORM, and for it, each query represents not just text, but a certain structure — models, a tree of conditions, and so on. So, in theory, ORM can both cache and hang invalidatement handlers directly when caching as needed. Damn an attractive solution for lazy guys like me.

A small example. Suppose we execute a query:

select * from post where category_id =2 and published

and cache it. Obviously, we need to reset the query if, when adding / updating / deleting a post, the condition category_id=2 and published=true is satisfied for its old or new version. After some time, for each model, lists of invalidators are formed, each of which stores a list of requests that should be reset:

post: category_id =2 and published = true : select * from post where category_id =2 and published select count ( * ) from post where category_id =2 and published select * from post where category_id =2 and published limit 20 category_id =3 and published = true : select * from post where category_id =3 and published limit 20 offset 20 category_id =3 and published = false : select count ( * ) from post where category_id =3 and not published foo: a =1 or b =10 : or_sql a in ( 2 , 3 ) and b =10 : in_sql a >1 and b =10 : gt_sql

etc.
In reality, it is more convenient for disabled users to store lists of cache keys, rather than query texts, texts here for clarity.

Let's see what happens when an object is added. We must go through the entire list of invalidators and erase the cache keys for the conditions that are fulfilled for the added object. But there can be many invalidators, and they should be stored in the same place where the cache itself, i.e. most likely not in the memory of the process and I would not like to load them all every time, and a consistent check of all conditions hurts the debt.

Obviously, it is necessary to somehow group and sift out invalidators without their complete verification. Note that the picture when the conditions differ only in values. For example, the invalidators in the post model all look like category_id =? and published =? .. Group the case of disrupters from the example according to the following schemes:

post: category_id =? and published =? : 2 , true : select * from post where category_id =2 and published select count ( * ) from post where category_id =2 and published select * from post where category_id =2 and published limit 20 3 , true : select * from post where category_id =3 and published limit 20 offset 20 3 , false : select count ( * ) from post where category_id =3 and not published foo: a =? or b =? : 1 , 10 : or_sql a in ? and b =? : ( 2 , 3 ), 10 : in_sql a > ? and b =? : 1 , 10 : gt_sql

Pay attention to the condition category_id =? and published =?, knowing the values of the fields of the post being added, we can uniquely fill in the labels "?". If the object is:

{id : 42 , title : "…" , content : "…" , category_id : 2 , published : true }

then the only suitable invalidator from the family will be category_id = 2 and published = true and, therefore, it is necessary to erase the corresponding 3 cache keys. Those. we do not need a consistent verification of the conditions; we immediately get the necessary invalidator according to the scheme and data of the object.

However, what to do with more complex conditions? In some cases, something can be done: or decompose into two disability, in deploy to or. In other cases, you either have to complicate things or make the disability redundant by discarding such conditions. Here is what the invalidators will be for foo after such transformations:

foo: a = ? : 1 : or_sql b = ? : 10 : or_sql, gt_sql a = ? and b = ? : 2 , 10 : in_sql 3 , 10 : in_sql

Thus, for each model, we only need to store charts (just lists of fields), according to which, if necessary, we build invalidators and ask for lists of keys that should be erased.

I will give an example of the invalidation procedure for foo. Suppose we have requested from the database object

{id : 42 , a : 1 , b : 10 }

{id : 42 , a : 1 , b : 10 }

changed the value of a to 2 and recorded back. When updating, the invalidation procedure should be run for both the old and the new state of the object. So, the invalidators for the old state: a =1 , b =10 , a =1 and b =10 , the corresponding keys or_sql and gt_sql (the last invalidator is missing, can be considered empty). For the new state, we get the invalidators a =2 , b =10 , a =2 and b =10 , which adds the key in_sql. As a result, all 3 requests are erased.

Implementation

If possible, I tried to abstract from the language and the platform, however, a working and working in a fairly loaded project system also exists. More about it and about the tricks of implementation in general in the next article.

Source: https://habr.com/ru/post/120471/

All Articles

On the issue of cache invalidation

Completeness and redundancy

Event Disability

Automatic invalidation of ORM requests

Implementation

More articles: