Cache Theory (Part Two, Practical, Augmented)

This is the second, additional (upd: supplemented) part of my article on caching information in web development. The first is called Cache Theory .

UPD: After numerous comments, I strongly reworked the article, added more specifics and examples to it, and also removed controversial points (for example, regarding memcached). Thank you all for your constructive criticism.

In this article I will try to describe the practical aspects of caching, focused primarily on sites and content management systems. Immediately I warn you, this is my personal opinion, which does not pretend to the ultimate truth. Most terminology is mine; you can use it if you see fit at your discretion. Constructive criticism is welcome.

So.
')

What to cache

No matter how strange it may seem, caching is not everything. Caching mechanisms themselves consume resources, and if we say information changes as often as it is output, then there is no sense in caching this information (for example, a statistics system). It is also not worth caching processes that are already fast.

For example: Data is changed every 1 second. There is a request to receive this data every 2 seconds. The data is cached for 0.1 seconds, and 0.2 seconds is sent from the cache. Then, when accessing data, there will always be a situation when we need to rebuild the data - which will take 1 + 0.1 seconds. And data from a cache practically will not be given. We lose this 0.1 sec because of the cache.

The first thing that needs to be cached is information that is calculated extremely long and resource-intensive, and is used very often. Applicable to sites are the results of the execution of modules (or applications) that use complex database queries. In addition, resource-intensive processes include calls to external resources using the connection setup (sockets, curl, etc.), as well as working with a large amount of complex data (parsing templates, working with images, etc.).

Where to cache

Further, in descending order of cache access speed:

Memory. These are memcached, APC, XCache and other similar technologies in the same vein. They (as a rule) provide the maximum access speed, but the volume is very limited. Memory is not suitable for big data, but it is well suited for relatively small amounts of the most frequently used data, such as say templates, etc.

For example, we use a SAX parser. It slowly parses the patterns. We do the following:

When requesting a template, it is first checked whether it is in memory. If so, it is pulled out, made unserialize and given. If not, we create an object (parse the template), serialize it ( serialize ) and store it in memory (we denote the name as a hash code from the file path). It remains only to decide on what basis we will update the cache data. This can be done in 2 ways:

1. We will not check changes at all, and the cache will exist for some time, say, updated once an hour. Accordingly, with a physical change in the pattern, they will take effect a maximum of an hour later.

2. We will monitor changes to the template by the time it was last saved. To do this, we need another variable in memory - the time the data is written to the cache (can be called as the name of the template variable in memory with the prefix time). Accordingly, we will have to set it equal to the current time when the template object is in memory. Further, when referring to the template, we first compare the time of the template file change ( filemtime ) with the time cached in memory. And if the time for changing the template is longer than the cache time, then we update it. With this approach, the cache can exist forever if the template itself never changes. But as soon as it changes, the cache will be rebuilt.

File system. The most commonly used method. But there are some pitfalls here too. Access to files is significantly slowed down, there are a lot of them in the directory (the more files, the slower the speed), and on some file systems there are generally restrictions on the number (ext2 - 32768 files in a directory). This must be strictly followed. For example, it is impossible to dump some tabular data in one directory and make the names equal to the primary keys. You just have such a scheme ever overflow.

This is how you can do this in php:

<?
function saveCache($name, $data)
{
$hash = sha1($name);
$chunks = str_split($hash, 4);
$cache_dir = CACHE_DIR.'/'.$chunks[0].'/'.$chunks[1];
if (!is_dir($cache_dir)) mkdir($cache_dir, 0775, true);
return file_put_contents($cache_dir.'/'.$hash, serialize($data));
}
?>

Database. Can also be used for cache. The database has a strong advantage — sampling through SELECT. If there is little data, but they depend on a huge number of conditions, then using the database is justified, especially if you correctly create indexes in the table. For example, a table in a database can be used as a repository of the result of sampling a complex query with the union of many large tables using a large number of conditions. The query itself can be placed in a temporary table, and data is already selected from it. The sampling conditions will also be complicated, but there will be no more numerous JOINs, which will increase the speed (especially if you use ENGINE MEMORY applicable to MySQL) .

Another advantage of caching in the database is proactive cache preparation. For example, the first and only request can be to pull out all the data for caching on a specific page and then use it if necessary. Caching in the database is of course slow, but competent organization can seriously improve the efficiency of using the cache. And yet - there is SQLite, which is also well suited for this. A special exclusive is the creation of the SQLite database itself in memcached.

Using a database for cache seems to me a rare option. This is more an opportunity than a practical application. Simply, do not lose sight of it.

How to cache

Caching is usually used to hash a string containing all the parameters on which the cache depends. If at least some parameter has changed, then the hash code itself will change. For storage in the file system, the first few characters are “cut off” from the hash and corresponding directories are created so that the file system does not overflow. The change time for the file system is the file modification time, for the database a separate field is needed, for the memory - a separate parameter (see the example above) .

Without a hash code, your dependency string can swell up a lot, especially if there are a lot of them.

Expanded example:

For example, in the database of 50,000 articles. A request to the base for one article works for a long time, which is not surprising with such volume. Just be lucky - we do not have JOIN other talits.

We do the following - we write some dependencies in the dependency table. The table we can have is a simple array that is serialized and put into a file. This is necessary to check the relevance of the cache, to decide to rebuild the cached data due to their change, or you can use the existing in the cache. With any change in our table in the database, we update this dependency in the table, i.e. we set the time equal to the current one. We have large volumes, so it is better to use caching in the file system.

Put the result of the execution in the cache. When re-requesting, we compare the cache file change time with the time from the dependency table. If the time of the cache file is less, rebuild - otherwise we give what is in the cache.

Now. If we have changed one letter in one of the 50,000 articles, the cache will drop for everyone, which is not effective. Let's try to avoid this:

Suppose we have for each article a place where it is displayed in full. There is also a tape of articles, which displays summary information for all, which is also cached as described above. If one article changes, then the place where it is displayed in full will change, as well as the tape (because it is present in it). Then we will create a separate cache for each article and a separate cache for the feed:

The tape cache obviously depends on the dependency grid (and implicitly depends on each individual article), i.e. the time of the last change to the table with the articles in the database. But a single article conditionally depends on this time, i.e. it depends on it, provided that this particular article has changed. Therefore, when showing a separate article, we will not use this time: if there is an article cache, we will show it. If not, then build the article and put in the cache. But, an important condition for this is that when editing a single article, we must delete its cache.

In summary: When editing an article, we update the time in the dependency table and destroy the cache of this article. When a tape is shown, the decision to update it is made based on a dependency table. When displaying a separate article, if there is a cache, then it is shown. thus, when the article is changed, the list cache and this article are restructured, but the cache of other articles is not affected.

What else you need to pay attention

When caching into the file system, you cannot push the entire cache into one place. That is, for each individual objects it is better to use your own directory, for example, for pages / cache / pages, for users / cache / users, and so on. It is necessary that the data do not coincide by chance. Suppose you have 2 different entities with the same id. It so happened that you need to save the cache of both only for this id. The hash in both cases will be the same, thereby causing a conflict. But if for each entity will be allocated its place, it will not.

When deleting an item, you must not forget to delete the cache itself. Otherwise, it will swell over time due to the large amount of irrelevant data. As an option - periodic physical removal of the entire cache.

There is still such a thing as a “fast” cache (FastCache, my terminology). The idea behind FastCache is to cache the most frequently used objects in the fastest way, sweeping aside everything else. For example, you can put a completely created home page in memory and give it, if nothing has changed, to non-logged users. The main load goes to the main page, so it can greatly relieve resources.

Conclusion

As correctly noted in one of the comments of the previous article, caching is part of website optimization, which is especially effective with high-load systems. Site performance does not consist of the effectiveness of the cache alone. Moreover, sometimes it can be generally superfluous, so you should think well before “screwing up the cache.” If, for example, site traffic of less than 1000 people per day, you can not think about caching. Although of course, depending on which site.

Thanks for reading!

Source: https://habr.com/ru/post/38911/

All Articles