⬆️ ⬇️

The problem of simultaneous rebuilding caches

A series of posts about “Web, caching and memcached” continues. Start here: 1 , 2 and 3 .

In these posts we talked about memcached, its architecture, possible use, caching key selection, clustering, atomic operations, and the implementation of counters in memcached .



Today we consider the problem of simultaneous rebuilding of the cache, which occurs when a large number of simultaneous accesses to the cache that has just been flushed or lost, which can lead to database overload.



The next post will be devoted to tagging caches.



Simultaneous rebuilding of caches



This problem is typical primarily for high-load projects. Consider the following situation: we have a sample of the database, which is used on many pages or especially popular pages (for example, on the main page). This sample is cached with some “shelf life”, i.e. the cache will be reset after a certain time interval. In this case, the sample itself is relatively complex, its calculation significantly loads the backend (DB). At some point in time, the memcached key will be deleted, because its lifetime will expire (the lifetime has been set for the cache), at this moment several frontend's (several, because sampling is often used) will be sent to memcached using this key, they will detect its absence and will try to build the cache again by sampling from DB. That is, the database will simultaneously receive several identical requests, each of which noticeably loads the database, if a certain threshold is exceeded, the request will not be executed within a reasonable time, even more frontend will access the cache, detect its absence and send even more queries to the database. which the database will not cope with. As a result, the database server received a critical load, and “laid”. This situation is called the dog-pile effect (see also, for example: korchasa.blogspot.com/2008/04/dog-pile.html ). What to do, how to avoid such a situation?

')

The problem with rebuilding the cache becomes a problem only when there are two factors: a lot of cache accesses per unit of time and a complex query. Moreover, one factor can compensate for the other: on a relatively unpopular, but very complex and long sample (which generally should not be), we can get a similar situation. So what to do?



We can suggest the following scheme: we no longer limit the lifetime of a key with a cache in memcached — it will stay there until it is superseded by other keys. But along with the cache data, we also record the real time of its life, for example:



{  : 2008-11-03 11:53,  : { ... } } 


Now, when retrieving the key from memcached, we can check whether the cache has expired using the "valid to" field. If the lifetime has expired, the cache needs to be rebuilt, but we will do it with a lock (we will discuss blocking in the next section), if we fail to block, we can either wait another time (since the lock is already there, someone rebuilds the cache) or return the old cache value. If we succeed in blocking, we build the cache ourselves, while other frontend will not rebuild the same cache, as they will see our blocking. The main advantage of storing in memcached without specifying the expiration date is the ability to get the old cache value in case the cache is already rebuilt by someone. What exactly to do — wait until someone else builds the cache, and get the new value from memcached, or return the old value — depends on the task, how much the old value is acceptable and how much time can be spent in the wait state. Most often, you can afford a 2-3 second wait with checking the removal of the lock and, if the cache is not built (which is unlikely, it turns out that the sample takes more than 2-3 seconds), return the old value, freeing the frontend for other tasks.



An example of such an algorithm



  1. We get access to cache cache, its lifetime has expired.
  2. We are trying to block by user cache_lock.
    • Failed to get the lock:
      • waiting for unlocking;
      • didn't wait: return old cache data;
      • Wait: select key values ​​again, return new data (built cache by another process).
    • Managed to get a lock:
      • build the cache yourself.


Such a scheme allows to exclude or minimize the situation of “collapsing” the backend with the same “heavy” queries, when it is really enough to execute the request only once. The last question remains, how to ensure correct locking? Obviously, since the problem of simultaneous rebuilding arises on different frontends, the lock should be in a place that everyone can access for them, that is, in memcached.





Locks in memcached



Consider two options for implementing a lock (mutex, binary semaphore) using memcached. The first one is incorrect, it cannot provide the correct exclusion of parallel processes, but it is obvious. The second is completely correct, but not so obvious.



Suppose we want to lock on the key 'lock' : trying to get the key values ​​using the get operation. If the key is not found, then there is no lock, and we use the set operation to set value of this key, for example, to one, and set the lifetime to a small time interval that exceeds the maximum lock lifetime, for example, to 10 seconds. Now, if the frontend ends abnormally and does not release the lock, it will automatically be destroyed after 10 seconds. So, using set we set lock, performed all the necessary actions, then remove the lock simply by deleting the corresponding key with the command del . If on the first get operation we got the key value, it means that the lock has already been set by another process, our lock operation is unsuccessful.



The described method has a drawback: the presence of a race condition (race condition). Two processes can do get at the same time, both can get the answer that there is no “key”, both will do set , and both will assume that they have set the lock successfully. In situations like simultaneous rebuilding of caches, this may be acceptable, because here the goal is not to exclude all other processes, but to drastically reduce the number of simultaneous queries to the database, which can provide this simple, incorrect option.



The second option is correct, and even simpler than the first. To capture a lock, it is enough to execute one command: add , specifying the name of the key and the lifetime (as small as in the first variant). The add command will be successful only if there is no key in memcached yet, that is, our process is the only process that managed to acquire the lock. Then we need to perform the necessary actions and release the lock with the command del . If add returns the error “such key already exists”, it means that the lock was taken earlier by some other process.

Source: https://habr.com/ru/post/43540/



All Articles