CPU Cache Myths About Programmers Believe

As a computer engineer who has been working on cache problems for five years at Intel and Sun, I understand a little bit about cache coherence . This is one of the most difficult concepts that had to be studied in college. But as soon as you really mastered it, a much better understanding of the principles of system design comes.

You might be wondering: why should a software developer think about the caching mechanism in the CPU? I will answer. On the one hand, many concepts from the concept of cache coherence are directly applicable in distributed systems and at DBMS isolation levels . For example, the representation of the implementation of coherence in hardware caches helps to better understand the difference in consistency models (consistency) - the difference between strict consistency and eventual consistency. You may have new ideas on how best to ensure consistency in distributed systems, using research and principles from the hardware.

On the other hand, misconceptions about caches often lead to false claims, especially when it comes to concurrency and race conditions. For example, they often talk about the difficulty of parallel programming, because "different cores in caches may have different / outdated values" . Or that a volatile qualifier in languages like Java is needed to “prevent local caching of shared data” and force “read / write to main memory only” .

Such misconceptions are mostly harmless (and may even be useful), but also lead to poor design decisions. For example, developers may think that they are free from the above-mentioned concurrency errors when working with single-core systems. In fact, even single-core systems are at risk of concurrency errors, unless appropriate concurrency constructs are used.
')
Or another example. If volatile variables are actually written / read from main memory every time, then they will be monstrously slow - links in main memory are 200 times slower than in L1 cache . In fact, volatile-reads (in Java) are often as productive as from the L1 cache , and this debunks the myth that volatile forces reads / writes only to main memory. If you avoided volatile due to performance issues, you may have been the victim of the above misconceptions.

Importance of consistency

But if different kernels have their own cache, which stores copies of the same data, will this not lead to discrepancy of records? Answer: hardware caches in modern x86 processors, like in Intel, are always synchronized. These caches are not just dumb blocks of memory, as many developers seem to think. On the contrary, very complex protocols and built-in inter-cache interaction logic ensure consistency across all threads. And all this happens at the hardware level, that is, we, software developers / compilers / systems, do not need to think about it.

Briefly explain what is meant by "synchronized" caches. There are many nuances here , but in the maximum simplification: if two different streams anywhere in the system read from the same memory address, then they should never read different values at the same time .

As a simple example of how consistent caches may violate the above rule, simply refer to the first section of this tutorial . No modern x86 processor behaves as described in the tutorial, but a buggy processor certainly can. Our article focuses on one simple goal: to prevent such inconsistencies.

The most common protocol for ensuring consistency between caches is known as the MESI protocol . Each processor has its own implementation of MESI, and different options have their own advantages, compromises and opportunities for unique bugs. However, they all have a general principle: each row of data in the cache is marked with one of the following states:

Modified state (M).
1. This data is modified and different from the main memory.
2. These data are the source of truth, and all other sources are outdated.
Exclusive (E).
1. This data is not modified and synchronized with the main memory.
2. No other cache at the same level has this data.
General (S).
1. This data is not modified and synchronized.
2. Other caches of the same level also (possibly) have the same data.
Invalid (i).
1. This data is outdated and should not be used.

If we apply and update the above states, then we can achieve cache consistency. Consider a few examples for a processor with four cores, each with its own L1 cache, as well as the global L2 cache on a chip.

Memory entry

Suppose a thread on core-1 wants to write to memory at address 0xabcd. Below are some possible sequences of events.

Cache hit

L1-1 has data in the E or M state.
L1-1 records. All is ready.
1. No other cache has data, so immediate recording is safe.
2. The status of the cache line is changed to M since it is now changed.

Local cache miss, flat cache hit

In L1-1, there is data in the state S.
1. This means that there may be this data in another single-level cache.
2. The same sequence applies if there is no data at all in L1-1.
L1-1 sends the Request-For-Ownership to the L2 cache.
L2 looks in its catalog and sees that L1-2 now has this data in state S.
L2 sends snoop-invalidate to L1-2.
L1-2 marks data as invalid (I).
L1-2 sends an Ack request to L2.
L2 sends Ack along with the latest data to L1-1.
1. L2 verifies that in L1-1 this data is stored in state E.
L1-1 now contains the latest data, as well as permission to enter state E.
L1-1 records and changes the state of this data to M.

Reading memory

Now suppose that a thread on core-2 wants to read from the address 0xabcd. Below are some possible sequences of events.

Cache hit

L1-2 has data in the state S, E or M.
L1-2 reads the data and returns to the stream. Is done.

Local cache miss, top-level cache miss

L1-2 has data in state I (invalid), that is, it cannot use it.
L1-2 sends a Request-for-Share request to the L2 cache.
There is no data in L2 either. It reads data from memory.
L2 returns data from memory.
L2 sends data to L1-2 with permission to enter state S.
1. L2 verifies that in L1-2 this data is stored in the state S.
L1-2 receives the data, caches it and sends it to the stream.

Local cache slip, top-level cache hit

L1-2 has data in state I.
L1-2 sends a Request-for-S request to the L2 cache.
L2 sees that in L1-1 the data is in the state S.
L2 sends Ack to L1-2, along with data and permission to enter state S.
L1-2 receives the data, caches it and sends it to the stream.

Local cache miss, flat cache hit

L1-2 has data in state I.
L1-2 sends a Request-for-S request to the L2 cache.
L2 sees that in L1-1 the data is in the E (or M) state.
L2 sends snoop-share to L1-1
L1-1 lowers the state to S.
L1-1 sends Ack to L2 along with modified data, if applicable.
L2 sends Ack to L1-2 along with data and permission to enter state S.
L1-2 receives the data, caches it and sends it to the stream.

Variations

The above are just some of the possible scenarios. In fact, there are many variations and there are no two identical implementations of the protocol. For example, some constructions use the O / F state . Some have writeback caches, while others use write-through . Some use snoop translations, while others use the snoop filter . In some, inclusive caches, and in others - exclusive . The variations are endless, and we haven't even touched the store buffers!

In addition, in the example above, a simple processor with only two levels of caching is considered. But note that the same protocol can be applied recursively. It is easy to add L3 cache, which, in turn, coordinates several L2 caches using the same protocol as above. You can have a multiprocessor system with "home agents" that coordinate the work of several L3 caches on completely different chips.

In each scenario, each cache only needs to interact with the top-level cache (to get data / permissions) and its descendants (to grant / cancel data / permissions). All this happens invisibly to the program flow. From the software point of view, the memory subsystem looks like a single, consistent monolith ... with very variable delays.

Why synchronization is still important

We discussed the amazing power and consistency of a computer's memory system. One question remains: if the caches are so consistent, then why do we even need volatile in languages like Java ?

This is a very difficult question, which is best answered elsewhere . Let me just hint a little. The data in the CPU registers is not synchronized with the data in the cache / memory. The software compiler performs all sorts of optimizations when it comes to loading data into registers, writing them back to the cache, and even reordering the instructions . All this is done under the condition that the code will be executed in one thread. Therefore, any data at risk of a race condition should be manually protected using parallel algorithms and language constructs like atomic and volatile.

In the case of the volatile qualifier in Java, the solution is partly to make all read / write operations bypass local registers, and instead immediately access the read / write cache . As soon as the data is read / written to the L1 cache, the hardware matching protocol takes effect. It provides guaranteed consistency in all global flows. Thus, if several threads read / write to one variable, they are all synchronized with each other. This is how coordination between threads is achieved in just 1 nanosecond.

Source: https://habr.com/ru/post/354748/

All Articles

CPU Cache Myths About Programmers Believe

Importance of consistency

Memory entry

Cache hit

Local cache miss, flat cache hit

Reading memory

Cache hit

Local cache miss, top-level cache miss

Local cache slip, top-level cache hit

Local cache miss, flat cache hit

Variations

Why synchronization is still important

More articles: