The data structures in memcached / memcacheDB. Part 1

Quite often we have to store data in memcached or memcacheDB . It can be relatively simple data, for example, cached samples from the database, and sometimes it is necessary to store and process more complex data structures that are updated from several processes simultaneously, provide fast data reading, etc. The implementation of such data structures no longer fits into a combination of memcached get / set commands. This article will describe how to store some data structures in memcached with code examples and a description of the main ideas.

Memcached and MemcacheDB in this article are considered together, because they have a common access interface and the logic of most data structures will be the same, in the following we will simply call them “memcached”. Why do we need to store data structures in memcached? Most often for distributed access to data from different processes, from different servers, etc. And sometimes for solving the problem of data storage, the interface provided by MemcacheDB is sufficient, and the need to use a DBMS is no longer necessary.

Sometimes a project is developed initially for an unallocated case (working within a single server), however, assuming the future need for scaling, it is better to use such algorithms and data structures at once that can provide easy scaling. For example, even if the data will be stored simply in the memory of the process, but the interface to access them repeats the semantics of memcached, then when moving to a distributed and scalable architecture it will be enough to replace the calls to the internal storage to the calls to the server (or server cluster) memcached.

A total of 6 data structures will be considered:

lock;
counter;
visitor counter;
event log;
array;
table.

')
From the first to the third in this part, and from the fourth to the sixth - in the next.

Code samples will be written in Python, they will use the following
memcached access interface:

 class Memcache(object): def get(self, key): """   . @param key:  @type key: C{str} @return:   @raises KeyError:   """ def add(self, key, value, expire=0): """   ,     . @param key:  @type key: C{str} @param value:   @param expire: " "    (0 — ) @type expire: C{int} @raises KeyError:    """ def incr(self, key, value=1): """     . @param key:  @type key: C{str} @param value:  @type value: C{int} @return:    ( ) @rtype: C{int} @raises KeyError:    """ def append(self, key, value): """      . @param key:  @type key: C{str} @param value:  @raises KeyError:    """ def delete(self, key): """  . @param key:  @type key: C{str} """

All our data structures will be derived from the following base class:

 class MemcacheObject(object): def __init__(self, mc): """ . @param mc:  memcached @type mc: L{Memcache} """ self.mc = mc

Lock

Task

A lock (mutex, or rather a spinlock in this situation) is not exactly a data structure, but it can be used as a “brick” when building complex data structures in memcached. A lock is used to exclude simultaneous access to certain keys; the lock feature is not available in the memcached API, so it is emulated using other operations.

So, the lock is determined by its name, it works distributedly (that is, from any process that has access to memcached), it has an automatic “die” (in case the process that put the lock cannot remove it, for example, due to a crash).

Lock operations:

try to lock (try-lock);
unlock (unlock).

Decision

 class MCCounter(MemcacheObject): def __init__(self, mc, name): """ . @param name:   @type name: C{str} """ super(MCCounter, self).__init__(mc) self.key = 'counter' + name def increment(self, value=1): """      . @param value:  @type value: C{int} """ while True: try: self.mc.incr(self.key, value) return except KeyError: pass try: self.mc.add(self.key, value, 0) return except KeyError: pass def value(self): """   . @return:    @rtype: C{int} """ try: return self.mc.get(self.key) except KeyError: return 0

Discussion

The correctness of the lock operation is based on the `add` operation, which guarantees that exactly one process will be able to“ first ”set the key value, all other processes will get an error. The above class can be wrapped more conveniently, for example, for Python with-handler . The issue of locks was discussed in more detail in a post about simultaneous rebuilding of caches .

Atomic counter

Task

It is necessary to calculate the number of events that occur distributed on different servers, also get the current value of the counter. At the same time, the counter should not “lose” events and charge “extra” values.

Data type: integer.

Initial value: zero.

Operations:

increase the counter value by the specified value;
get the current value of the counter.

Decision

 def time(): """         (, UNIX Epoch). @return:     @rtype: C{int} """ class MCVisitorCounter(MemcacheObject): def __init__(self, mc, name, T): """ . @param name:   @type name: C{str} @param T:  ,  @type T: C{int} """ super(MCVisitorCounter, self).__init__(mc) self.keys = ('counter' + name + '_0', 'counter' + name + '_1') def _active(self): """    . @return: 0  1 """ return 0 if (time() % (2*T)) < T else 1 def increment(self): """   . """ active = self._active() while True: try: self.mc.incr(self.keys[1-active], 1) return except KeyError: pass try: self.mc.add(self.keys[1-active], 1, 2*T) #  43 return except KeyError: pass def value(self): """   . @return:    @rtype: C{int} """ try: return self.mc.get(self.keys[self._active()]) except KeyError: return 0

Discussion

The implementation of the counter is quite obvious, the most difficult moment is the “eternal” cycle in the increment method. It is necessary for the correct initialization of the counter value in a competitive access to it. If the incr operation failed with the lack of a key, we need to create a counter key, but this code can simultaneously perform several processes, then one of them will be able to add , and the second will receive an error and increment the already created counter using incr . Looping is impossible, because One of the incr or add operations must be successful: after creating a key in memcached, the incr operation will be successful for the entire incr key.

How to deal with the situation when the key with the counter from memcached disappears? Two situations are possible when the key disappears:

memcached not enough memory to host other keys;
memcached process terminated abnormally (server “crashed”).

(In the case of MemcacheDB, configured replication, etc. server crash should not lead to loss of keys.) In the first case, you only need to properly maintain memcached - there should be enough memory for the counters, otherwise we start to lose value, and this is unacceptable in this situation ( but perfectly normal, for example, for storing caches in memcached). The second case is always possible, if this situation often arises, you can duplicate the counters in several memcached-servers.

Visitor count

Task

It is necessary to calculate the number of unique hits to the site during a certain period T. For example, T may be equal to 5 minutes. Suppose we can say when the site’s last appeal to the site was more than T minutes ago or less (that is, is it unique during period T).

Operations on the counter:

increase the counter value by one (circulation unique during the period T visitor);
get the current number of visitors.

Decision

 def time(): """         (, UNIX Epoch). @return:     @rtype: C{int} """ class MCVisitorCounter(MemcacheObject): def __init__(self, mc, name, T): """ . @param name:   @type name: C{str} @param T:  ,  @type T: C{int} """ super(MCVisitorCounter, self).__init__(mc) self.keys = ('counter' + name + '_0', 'counter' + name + '_1') def _active(self): """    . @return: 0  1 """ return 0 if (time() % (2*T)) < T else 1 def increment(self): """   . """ active = self._active() while True: try: self.mc.incr(self.keys[1-active], 1) return except KeyError: pass try: self.mc.add(self.keys[1-active], 1, 2*T) #  43 return except KeyError: pass def value(self): """   . @return:    @rtype: C{int} """ try: return self.mc.get(self.keys[self._active()]) except KeyError: return 0

Discussion

The basic structure of working with a counter in memcached in this task repeats the solution of the problem of an atomic counter. The basic idea is to use the shadow and current counter: the shadow counter receives updates (increases), and the current one is used to get the number of visitors (its value was accumulated when it was shadow). Each period T current and shadow counter swapped. The key of the shadow counter is created with a shelf life of 2 * T seconds, that is, its lifetime is T seconds while it was shadow (and increased), and T seconds while its value was used for reading. By the time this key becomes a shadow counter again, it should be destroyed by memcached, since its expiration date has expired. The idea of a shadow and current counter resembles, for example, double buffering when displayed on a screen in games.

It is necessary to pay attention to the _active() function: it is implemented by calculating the remainder of dividing the current time in seconds, and not, for example, through a periodic (once every T seconds) change of the active value, since if the time on all servers is synchronized with reasonable accuracy, then the result of the _active() function will be the same on all servers, which is important for the correct distributed operation of this data structure.

The described approach will work correctly only with a sufficiently large number of visitors, when the time interval between the next change of the _active() function _active() and the increment() function call will be as short as possible. That is, when the increment() is called frequently, that is, when there are a lot of visitors, for example, 1000 people with a period T = 3 minutes. Otherwise, the creation and destruction of keys will not be synchronized with the moment the _active() value is _active() and visitors accumulated in the shadow counter, which will be destroyed during the accumulation of values and created anew, will disappear. In this situation, its lifetime 2 * T will be “shifted” relative to the moments 0, T of the time ()% (2 * T) function. Due to the fact that memcached considers the shelf life of keys discretely relative to seconds (with an accuracy of no more than one second), any shift in the key lifetime can be 1.2, ..., T-1 second (no shift - 0 seconds). The expiration date resolution means that if a key was created in 25 seconds 31 milliseconds with a shelf life of 10 seconds, then it will be destroyed in the 35th (or 36th) second 0 milliseconds (and not the 31st millisecond). To compensate for the shift in an integer number of seconds, you need to correct the line 43 of the example as follows:

 self.mc.add(self.keys[1-active], 1, 2*T - time() % T)

The value of the visitor counter does not change over a period of time T; for a more pleasant display, you can add a normally distributed value with a small variance and expected value of about 5% of the counter value to the counter value during output.

A slightly more complicated version of such a counter (with interpolation in time), but with exactly the same idea was described in a post about the atomicity of the operation and the counters in memcached .

UPD : The second part of the article.

Source: https://habr.com/ru/post/50243/

All Articles

The data structures in memcached / memcacheDB. Part 1

Lock

Task

Decision

Discussion

Atomic counter

Task

Decision

Discussion

Visitor count

Task

Decision

Discussion

More articles: