Cache theory

Introduction

Caching is an integral part of all complex systems. Caching allows you to significantly increase the speed of the application by saving hard-to-calculate data for later use. However, I will not go into the definitions, most developers know perfectly well what it is.

In this article I will try to "sort through" the problem of caching, focused primarily on sites and content management systems. Immediately I warn you, these are my personal considerations, which do not pretend to be the ultimate truth. All terminology is also mine; you can use it if you see fit at your discretion. Constructive criticism is welcome.

Core caching issues

When caching data, as a rule, there are no problems in their preservation and reuse. The main problem is to determine when the cache loses its relevance. That is, to determine whether it is possible to use data from the cache, or it is necessary to calculate this data again, since something could have changed. I will call this the problem of cache relevance (my terminology).
')
The second caching problem is the performance problem. And if our algorithm does not run without caching faster than caching? Despite the absurdity at first glance, it is more than the case. The fact is that caching algorithms also consume resources, and it can easily happen that the amount of resources consumed by the cache exceeds the amount of resources needed to calculate the data itself. This often happens in 2 cases. The first case is when a site of 10 pages is put on a powerful framework. Everything is simple - the source data is so small that the speed of the calculation is very high, and caching only slows down the work. The second case is the reverse - when the site is so large that the size of the cache grows to a volume which leads to a significant slowdown in their search in the cache.

To achieve a solution to both problems is impossible. It is possible only to choose a method that provides maximum performance. And even better - that this method was chosen automatically.

The problem of speed is not so urgent, and is solved either by disabling the use of caching in the configuration, or by periodically physically deleting the cached data and switching to faster hardware.

But as for the problem of relevance, everything is much more complicated ...

Cache theory

The author distinguishes 4 main types of information caching:

1. Independent or static - takes place when it is not necessary to check whether the object has changed. The algorithm is simple - if there is data in the cache, then give the data, otherwise calculate them. Effectively within the framework of the work of one process, when frequent use of the results of complex intermediate calculations is required. After the process is complete, the static cache dies and for the repeated process all data is calculated again. In general, in any programming language, all temporary variables are nothing more than a static cache. If caching of the function results is required, you can use this algorithm using static variables using the example of PHP:

function getData($id)
{
static $data = array();

if (isset($data[$id])) return $data[$id];

//
$data[$id] = $newData;

return $data[$id];
}

2. Explicitly dependent - occurs when the decision to update the cache is made on some easily calculated basis. This primarily includes caching data taken from a specific file, as well as caching over time. In these cases, it is easy to determine whether the file has changed or whether the cache has expired (and the cache clearly depends on these factors). The algorithms here are also quite simple - they come down to condition checks, but, unlike the static cache, it can be independently used by different processes.

3. Implicit-dependent - occurs when a change in a caching object depends on many factors. For example, the structure of the object includes many other objects that can also change. That is, an object implicitly (not directly) depends on others, which may also depend on others, etc. The combination of these factors is the implicit dependencies of the object, and in order to decide whether the object has changed, it is necessary to “interrogate” them all, which is not always possible or advisable. For example, a function that displays information taken from a database. We can cache the result of a function, but we need to know when the data changes in order to update it. And this is equivalent to referring to the table, which in terms of resource intensity is equivalent to performing the function. And if a lot of tables? The cache meaning is lost.

This is revealed through the so-called "dependency table" . There are 2 columns in this table: objects and the time they were changed. When you change objects, you need to update their times in the table. The speed of access to the table itself is very high (as a rule, it is in a static or explicitly-dependent cache), and the times for changing all the objects we need can be obtained very quickly. And if at least one time to change the necessary objects exceeds the caching time, then the cache must be reset. Thus, implicit dependencies become explicit - the object depends on a certain set of times in the dependency table.

I will give an example. There is a certain CMS module that deals with the display of comments. We have 2 tables in the user database and comments that are related by a one-to-many relationship. When changing data of a certain user, we set the current time in the row “users” in the dependency table. If comments have changed, we do the same with the “comments” line. The module knows that it depends on “users” and “comments” and if at least one of them exceeds the cache change time, the cache is reset, otherwise it is used. I’ll draw your attention to the fact that “users” and “comments” are not necessarily the time of a table change, but rather the time of a change of a certain entity, which may consist of multiple tables and other parameters.

4. Conditionally dependent - occurs when an implicitly-dependent cache can be cast to explicitly dependent or static if some condition is met. For example, it can be argued that the time of the last change of the dependency table is the time of the last change of data on the site. If the user does not have unique data on the page (read - he is not logged in), then the whole page will depend only on the dependency table. So, the whole page can be put in the cache, on this condition. This includes the cache, which at the right time is simply physically deleted.

Conclusion

As you can see, data caching is a very complex process, requiring a large amount of resources for its development. Do not forget that in our time the cost of iron can be cheaper than the cost of developing complex algorithms. Therefore, if there are problems with the speed of the site, it may be worth buying just a more powerful server - in some cases it will be cheaper.

Described by me - only the basis. Here you can add a lot of things, just do not want to make an article too big. If the article seems interesting, I will try to write a sequel.

Source: https://habr.com/ru/post/38771/

All Articles

Cache theory

Introduction

Core caching issues

Cache theory

Conclusion

More articles: