"Memory Features": What does the processor cache and office clerk have in common?

We in “ IT-GRAD ” tell in our blog about interesting technologies in the field of virtualization, for example, we recently talked about cloud ERP-systems , as well as virtual GPUs .

Today we delve into the "iron theme" and talk about the processor cache.

')
/ photo by Isaac Bowen CC

When the first computers appeared, the RAM cost a lot of money and was very slow, and the processors, in truth, did not shine with performance. But in the 1980s, the difference in speed between the CPU and DRAM began to increase dramatically: the performance of the processors took off, while the speed of access to memory grew at a less “operational” rate. The larger this difference became, the more obvious it became that only a new type of memory could eliminate the gap.

CPU and DRAM performance differences

The main indicator of cache efficiency is hit rate (cache hit). A hit is a situation where there is an entry in the cache with an identifier that matches the identifier of the requested data item. Large caches have a higher percentage of hits, but also a greater delay.

To smooth out this contradiction, most computers use several cache levels, with smaller and faster caches followed by slower and more voluminous caches. Currently, architectures are used that have up to three levels of cache hierarchy.

For ease of understanding, Fabian Guizen decided to compare cache with office work. Imagine a clerk sitting at a desk and looking at documents in folders. The clerk desktop is the L1 data cache. It contains the folders it works with (cache lines), browsing each page (bytes in the cache line).

The cabinet installed in the office is a second-level L2 cache. It stores the folders with which the clerk has completed work. To take the desired file from the cabinet, you need to go to it and find it in the directory. In addition, access to the cabinet have other employees who play the role of CPU cores. Obviously, working with the removal of files from the cabinet is slower.

More files are stored in the archive located in the basement. This is a L3 level cache. The archive is much more space, but quickly find the desired file will not work. No matter how tightly the files are located in the archive, they all do not fit there. The bulk of the papers are stored in a warehouse where a truck with unused files is sent once a day - this is our DRAM.

So why not leave one level of cache? Suppose the entire cache is stored at the L1 level, and all the CPU cores have access to it. Then, following our analogy, instead of allocating each clerk for a separate workplace, we plant all the staff for one hundred-meter table. Imagine how operational their work will be.

Thus, even if you make one large cache shared by all processor cores, its speed will be significantly lower than the cache with a multi-level structure. Let it be more difficult, but thanks to it the task of high performance is solved.

How to "speed up"

Despite the efficient organization of cache levels, this type of memory can run even faster. In other words, a number of methods allows you to increase the number of hits in the cache, or more accurately reduce the number of misses (miss). A slip is a situation when there is no entry in the cache containing the requested data item, and this item must be read into the cache from main memory.

The important fact is that the data gets into the cache in two ways. Path one: data is requested in the process of work and retrieved from the main memory, but with a delay.

The second way involves preloading the data before actually accessing it, and the developer must submit what data he will need in the future. This method is known as prefetching and allows you to dramatically reduce the time to load data from RAM. Prefetching is implemented both in software and hardware - most modern CPUs have special embedded devices (prefetcher).

The cache size, especially at the lower levels, is small, so to download new data to the cache it is often necessary to free up space in it, displacing part of the data (by analogy with the office, taking old files to the closet and putting new ones on the desktop). There are several extrusion algorithms:

The displacement of the first / random cell. The most simple to implement, but in some cases this approach may reduce cache performance;
FIFO (first in, first out): the first cell to be pushed out. Uncomplicated implementation, low overhead for equipment, but there are occasional delays;
LRU (least recently used): the oldest cell is being supplanted — one that has not been accessed for a long time, and it is assumed that it will not be accessed in the near future. The most popular way to guarantee results with relatively low complexity of implementation. Read more about LRU in the discussion of StackExchange users.

A useful solution is offered by the user of the social service Quora. He talks about the method that allows you to increase the performance of the cache when writing programs.

The technique is called loop blocking, or the division of a loop into blocks. Its goal is to reduce the number of cache misses and thus increase its performance. The essence of the technique is to partition the data into smaller blocks that can fit entirely in the cache: in the future they can be used repeatedly. A detailed example is provided on the Intel website .

In addition, you should pay attention to non-linear data structures such as linked lists: they increase the number of cache misses. Linear structures, such as arrays, allow for more efficient operation.

To preserve the integrity of the cache, each kernel must process a separate set of data throughout the cycle. To do this, the application must be divided into threads and manage these threads.

Among other things, modern engineers are looking for new ways to optimize the cache structure. The first thing that comes to mind in this case is to add a new cache level. As practice shows, every 10 years the number of cache levels increases by one. Continuing this tradition, Intel's Haswell and Broadwell chips are already supplied with a fourth-level cache (L4).

Other topics that we talk about in our blog about corporate IaaS :

Source: https://habr.com/ru/post/307488/

All Articles

"Memory Features": What does the processor cache and office clerk have in common?

How to "speed up"

More articles: