Experimental Determination of Cache Memory Characteristics

In some cases (for example, to fine-tune the program for a specific computer), it is useful to know the characteristics of the cache subsystem: the number of levels, the time to access each level, their size and associativity, etc.
For a one-time optimization, the required values can be viewed in the specification for a computer, but when automatic optimization is required (for example, during assembly and installation of the program), the characteristics have to be determined indirectly, based on the results of running a special test suite.
A convenient Linux test program is the lat_mem_rd from the lmbench test suite . Her job is that she allocates an array in memory and reads its elements with a given step, cycling through the array over and over again. Then a larger array is allocated, and so on. For each step value and array size, the average access time is calculated.
An example of a graph that was obtained by this program on a real system:

Recall that the cache is divided into blocks (lines): each block is stored as an indivisible unit, and is read from the next level of memory entirely. The low bits of the address determine the offset of the desired byte in the cache block. The blocks are usually organized in two dimensions: several sets, from which the necessary one is selected by the next low-order bits of the address; and several waves from which the desired one is chosen arbitrarily - most often, according to the LRU rule.
For each block, a tag is stored that contains the address of the data stored in the block in main memory. When accessing the memory, all tags of the desired set are compared with the tag of the desired address. For example, for a cache of 32 blocks organized in 4 directions:

Due to the fact that the set is selected by the lower-order bits of the address, the continuous in-memory array will be located continuously and in the cache. For example, if the block size is 64 bytes, and the first block of the array is in set 0:

If the array is placed in the cache as a whole, then there is no miss when it is accessed: each next block of the array falls into the next free wave. On subsequent passes through the array, it will all be read from the cache, and no new access to the next level of memory will be required.
If the array size exceeds the cache size, then, in accordance with the LRU rule, the first blocks of the array will be replaced by its last blocks. For example, if the array consists of 35 blocks, then the first three blocks will be forced out:

Now, during the second pass through the array, the first three sets will be filled up again from the next memory level, and the remaining five sets will be read directly from the cache. At the end of the second pass, the last three blocks will again force out the first blocks of the array:

Thus, with each subsequent pass, 20 blocks will be read from the cache, and the remaining 15 blocks will be read from the next memory level. When the array size exceeds the cache size by a whole wave (in our example, by 8 blocks), the data between passes will not be stored in the cache: every access will result in a miss and reading a new block from the next level memory.

What we have in the bottom line?

while the array is smaller than the cache, the access time to it will be constant (0% miss)
when the array is larger than the cache by less than one, the access time to it will depend on the size of the “surplus”
when an array is larger than one cache or more, the access time to it will be constant (100% misses)

Therefore, each level of the cache corresponds to one “step” on the graph, and the size of the cache is the left edge of the step . On the graph shown at the beginning, there are two steps - 32KB and about 4MB, which means that the cache is two-level.
Further, by the steepness of the steps, we can judge the associativity of the cache. If the cache is non-associative (direct mapping: just one), then the size of the cache is equal to the size of the cache, and 100% of the misses begin with a double cache size. If the cache is two-lined, then the size of the cache is half the size of the cache, and the right edge of the step is one and a half size of the cache. And so on: if the cache is fully associative (only one set), then the size of the vei is equal to the size of the block, and the step is vertical: exceeding the size of the cache even by one block results in 100% of the misses.

On our chart, the first step is clearly vertical, and the second one is rather blurred. It can be concluded that the first level cache is code and data distributed, and therefore each pass through the array causes the same events in the cache; and the second level cache is shared, so it also contains the code of the test program, and the code and OS data; therefore, the second step begins even before the size of the array reaches the size of the cache. The cache size is the size of the array from which the reproducible linear growth of the array access time begins; on our graph it is 6MB, which is confirmed by the data specifications for the processor.
')
The dependence of access time on the size of the array is analyzed. What about step size? The first important fact: for the same size of the array, the step size does not affect the number of misses. If the step is larger than the block size, then the cache will not read the entire array, but individual blocks; Part of the sets will remain free. In any case, we get 0% of misses until the array is smaller than the cache, and 100% of the misses if the array is one more wide. For example, for an array with a size of 36 blocks, and a step of 128 bytes (2 blocks):

On each pass, there is a miss in sets 0 and 2, and 10 blocks are read from the memory in a new way. The contents of sets 4 and 6 (8 blocks) are stored between passes, and read directly from the cache. We have 10/18 = 56% of misses. If the step is 256 bytes (4 blocks):

On each pass, there is a slip in set 0, and 5 blocks are read from memory in a new way. The contents of the set 4 (4 blocks) is stored between passes, and is read directly from the cache. We have 5/9 = 56% of misses.

The situation changes somewhat if the step size exceeds the size of the path: in this case, the “skipped” waves remain free, and all read blocks of the array are placed in the cache, even if the array itself is larger than the cache size. For example, for an array of 64 blocks in size, and a step in kilobytes (16 blocks), only four blocks are read, and they all fall into the same set:

A similar situation always occurs with a fully associative cache, whose size is the same as the block size. A distinctive feature of such a passage is that the left edge of the step shifts depending on the step size: twice the larger step corresponds to twice the “apparent” size of the cache, because every second way is “skipped” when reading.

On the graph shown at the beginning, the left edges of the steps are the same for all step sizes - this means that the size of the cache of each cache is larger than the maximum of the tried and tested step sizes, and is at least 4KB. For L1 cache, we see the vertical step: 0% misses for an array of 32KB in size, and 100% misses for an array of 36KB. As we saw above, this means that the 4KB additive fills the whole thing; therefore, the size of the veya is exactly 4KB, i.e. cache L1 eight-way.

For the L2 cache, we see a linear growth of up to about 12MB, i.e. up to double cache size, which corresponds to a non-associative cache. However, the processor specification states that the L2 cache is 24-way, i.e. with the size veya 256KB. It can be concluded that either the implementation of the cache does not conform to the specification (would anyone pay attention to this?), Or the algorithm for choosing a path is different from LRU. In any case, it is worth noting that the actual access time to the cache when cycling through a large array is better (flatter step) than if the cache was actually a 24-way LRU.

And a couple of interesting phenomena that can be observed on the chart. Starting with an array of 1 MB in size, the memory access time depends on the step size - as we saw above, this cannot be explained by the occurrence of cache misses. A similar picture would be observed if the step size were less than the block size: then after the cache overflow, the number of misses does not reach 100%, but a smaller level depending on the step size. For a half-block step, we get 50% misses (after each block read from the next level memory, we read the next array element from the cache already); for a quarter block step, we get 25% misses (after each block read from memory, three reads from the cache), and so on. How can there be blocks in the cache of several kilobytes?

The point, as you might guess, is in paging memory and DTLB overflow. 1MB is 256 four-kilobyte pages; therefore, the capacity of our DTLB is 256 entries. When an array exceeds 1MB, then each time an element is read, in addition to reading the actual data of the array, it is also necessary to read from the memory a record of the page table (PTE). For a step of 512 bytes, after each PTE reading, there are seven DTLB hits, and 12% of the misses barely noticeably increase the access time to the array. On the other hand, for a 4K step, every PTE reading is done from memory (100% of the misses in DTLB), and the access time increases significantly. We see that 100% of the misses in the DTLB are not achieved immediately, but on an array of 1280KB, i.e. exceeding DTLB capacity by a quarter. We conclude from this that the DTLB is four-lane.

Another interesting phenomenon is that the access time to an array that exceeds the L2 cache size (i.e. each access is performed to the last level memory, RAM) also depends on the step size, and the difference is larger than could be explained by misses in the DTLB. An additional difference is explained by the SDRAM protocol, in which the operations “opening a memory line” and “reading from an open line” are separated. Similarly, with misses in the DTLB, a smaller step allows you to read the next element of the array from an already open line, whereas with a larger step you need to open a new line with each access. I did not execute the numerical verification of the compliance of the observed delay to the opening delay of the line, since could not find the specification for the used memory modules.

Happy new year, habrovchane!

Source: https://habr.com/ru/post/111011/

All Articles

Experimental Determination of Cache Memory Characteristics

Happy new year, habrovchane!

More articles: