Plug-in Algorithms: Non-robust Cache

(The fact that the Russian translation of the concept of "lock-free" in the literature has not yet been settled, does not at all convince me that there should not be such a translation.)

Suppose analysis of the performance of your application revealed that a significant part of the processor time is spent in some computational function, and moreover, this function is repeatedly called with the same parameters - performing the same calculations again and again. A simple optimization suggests itself - a cache from one record in which the source data and the result of the last calculation would be stored.

 BOOL IsPrime (int n)
 {
  static int nLast = 1;
  static BOOL fLastIsPrime = FALSE;

  // if the parameter value has not changed since last time,
  // use the finished result
  if (n == nLast) return fLastIsPrime;

  // calculate and remember the new result
  nLast = n;
  fLastIsPrime = slow_IsPrime (n);
  return fLastIsPrime;
 }

Of course, this code is not safe: if one thread is inside a slow_IsPrime() call, the other thread that calls IsPrime() will find the values of the nLast and fLastIsPrime inappropriate for each other.
')
A simple solution is to wrap the code in a critical section; but simplicity is at the expense of performance: if, say, nLast = 5, fLastIsPrime = TRUE, and two threads simultaneously call IsPrime(5) , then they have absolutely no need to line up: nothing prevents them from simultaneously using the cached value.

Let's see how you can implement our cache without a lock. We will use two tricks from the last post at once: updating the version number with each data change, and the “make, write, (drop)” model to synchronize the write to the cache.

 #define IsLocked (l) ((l) & 1)

 BOOL IsPrime (int n)
 {
  static int nLast = 1;
  static BOOL fLastIsPrime = FALSE;
  static LONG lCounter = 0;

  // try to take the value from the cache
  LONG lCounterStart = InterlockedReadAcquire (& lCounter, -1);
  if (! IsLocked (lCounterStart) && n == nLast) {
   BOOL fResult = fLastIsPrime;
   // nobody touched the cache behind our back?
   if (InterlockedReadRelease (& lCounter, -1) == lCounterStart)
    return fResult;
  }

  // either the reading from the cache failed, or the value did not fit:
  // calculated in the usual way
  BOOL fIsPrime = slow_IsPrime (n);

  // try to save the value in the cache
  lCounterStart = lCounter;
  if (! IsLocked (lCounterStart) &&
      InterlockedCompareExchangeAcquire (& lCounter,
               lCounterStart + 1, lCounterStart) == lCounterStart) {
   nLast = n;
   fLastIsPrime = fIsPrime;
   InterlockedCompareExchangeRelease (& lCounter,
               lCounterStart + 2, lCounterStart + 1);
  }
  return fIsPrime;
 }

The lCounter order bit in lCounter means that the cache is “locked” during recording; the remaining bits store the version number. This organization allows you to combine unlocking and updating the version number into one simple operation.

The function consists of two parts: reading from the cache and writing to the cache. When reading from the cache, we first read the lCounter operation Acquire to make sure that the read values nLast and fLastIsPrime were written before the version number. If, judging by the number read, the cache is not blocked, then we read from it the last values of the parameter and the result. If the result fits, we take it; but first you need to make sure that the version number has not changed during the time we read from the cache. If it has changed, then perhaps the values we read do not match each other, so the “suitable” result is in fact unreliable.

In both cases - if the cached result did not fit, or if the read data turned out to be unreliable, - just wipe the cache on the cache that turned out to be useless, and perform the calculations completely.

When writing to the cache, we not only check that the cache is not locked, but at the same time we also block it ourselves, setting the low bit. (If it turns out that the cache was blocked, it is also not scary: no one will be offended if we do not write the calculated result into it. The goal is to save time and avoid recomputing at any cost.) By blocking the cache, we update it data, and then one operation InterlockedCompareExchangeRelease remove the lock and increase the version number. The operation must be Release, so that changes to the data are recorded in memory before the lock is released.

We use the fact that the success of operations on the cache is not necessary: when reading and writing, if the cache is blocked, then, in theory, it is better to abandon the advantages of caching - “To hell with everything! I will do everything myself! ”- than lining up for the right to access the cache. At the cost of redundant computation, we avoid problems like priority inversion (when a high priority thread is waiting for the lock to be released, occupied by a low priority thread).

It is important to note that the resulting system is not completely lockless : using lockless algorithms, we actually implemented an effective lock. If the thread that locked the cache for writing is “stuck”, all the other threads, although they continue to run, will no longer be able to use the cache.

A similar system can be implemented using TryEnterCriticalSection :

 BOOL IsPrime (int n)
 {
  static int nLast = 1;
  static BOOL fLastIsPrime = FALSE;
  BOOL fHaveAnswer = FALSE;
  BOOL fIsPrime;

  // try to take the value from the cache
  if (TryEnterCriticalSection (& g_cs)) {
   if (n == nLast) {
    fHaveAnswer = TRUE;
    fIsPrime = fLastIsPrime;
   }
   LeaveCriticalSection (& g_cs);
  }
  if (fHaveAnswer) return fIsPrime;

  // either the critical section is busy or the value did not fit:
  // calculated in the usual way
  fIsPrime = slow_IsPrime (n);

  // try to save the value in the cache
  if (TryEnterCriticalSection (& g_cs)) {
   nLast = n;
   fLastIsPrime = fIsPrime;
   LeaveCriticalSection (& g_cs);
  }
  return fIsPrime;
 }

The first way is better because the threads reading from the cache do not interfere with each other; while using the critical section, simultaneous reads are not possible.

Starting with Windows 7, we can also enable slim reader-writer locks :

 BOOL IsPrime (int n)
 {
  static int nLast = 1;
  static BOOL fLastIsPrime = FALSE;
  BOOL fHaveAnswer = FALSE;
  BOOL fIsPrime;

  // try to take the value from the cache
  if (TryAcquireSRWLockShared (& g_lock)) {
   if (n == nLast) {
    fHaveAnswer = TRUE;
    fIsPrime = fLastIsPrime;
   }
   ReleaseSRWLockShared (& g_lock);
  }
  if (fHaveAnswer) return fIsPrime;

  // either the cache is locked for writing, or the value did not fit:
  // calculated in the usual way
  fIsPrime = slow_IsPrime (n);

  // try to save the value in the cache
  if (TryAcquireSRWLockExclusive (& g_lock)) {
   nLast = n;
   fLastIsPrime = fIsPrime;
   LeaveSRWLockExclusive (& g_lock);
  }
  return fIsPrime;
 }

But in this implementation, threads reading from the cache can interfere with writing to the cache; whereas in our first implementation, “without a lock in the spirit”, only simultaneous recording attempts lead to conflict. If IsPrime() is called several times with one value (for example, 13), and then many times in a row with another value (for example, 17) - then the head of the streams that check if the result is not cached for 17 simply won't let any stream through to cache the result for 17! It turns out that if the load on the cache is very high (which is exactly why we added it!), Then the implementation based on slim reader-writer locks turns the cache into an almost useless one.

Source: https://habr.com/ru/post/118515/

All Articles

Plug-in Algorithms: Non-robust Cache

More articles: