PC Development and Performance - Memory Latency

Herb Sutter (by Exceptional C ++, former head of the ISO C ++ standards committee, Mr. Free Lunch Is Over and so on and so forth) works at Microsoft and sometimes reads atomic lectures on Wednesdays.

I finally got one of these, and was very happy. Smart men are always happy to look and listen.
For the report - apart from Herb, I saw Oleksandrescu live and Walter Bright live (which is " D ").

The lecture was called “Machine Architecture: Things Your Programming Language Never Told You” ( here you can download the presentation and video) and was about a specific part of the abstraction penalty - Memory Latency.
')
I will try to briefly talk about the key idea of the lecture. It is simple, obvious and said a thousand times. I think once again repeat the alphabet - it never hurts.

For the smallest, what is Latency and Bandwidth

Bandwidth is the width of the channel. How much data can be pumped per second, how many instructions can be used to fully load the ALU, and so on.
Latency is the length of the channel, that is, after what time the data you asked for will come to you. After how many clock cycles the requested bit from memory will come to you, after how many clock cycles the result of the instruction will be ready, when the command passes to the end of the pipeline and so on.
And they, of course, affect each other. As soon as a result is needed, and there is nothing more to be done - the whole bandwidth is idle because of latency. They requested a memory that is not in the cache - we are sitting, waiting for the memory. We wanted to follow the instruction that requires the result of the previous one - we are waiting for its implementation. This creates “bubbles” in the channel and accordingly reduces the load.

Herb in the presentation uses the example of the pipeline, it is quite obvious. You can pump a wild amount of barrels per minute, but each barrel goes to the destination a few days. In pure bandwidth and latency.

A practically important point is that bandwidth is always easy to buy. Put two processors, take twice as much data from memory, put two computers in the end. Latency is much more expensive - two women will not give birth to a child in 4.5 months, and it moves only with progress - to increase the frequency, reduce the size of the elements, change the technology, and so on.

And now the last 20 plus years show that latency is growing much slower. Especially - the latency of memory.

Scha, Herb had a sign there ...

                                 
                     1980 VAX-11/750 Modern Desktop Improvement since 1980 

 Clockspeed (MHz) 6 3000 + 500x 

 Memory size (RAM, MB) 2 2000 + 1000x 

 Memory bandwidth (MB / s) 13 7000 (read) + 540x 

                                         2000 (write) + 150x 

 Memory latency (ns) 225 ~ 70 + 3x 

 Memory latency (cycles) 1.4 210 -150x (!!!!!!)

From the plate it is clearly seen that the processor grows well, the size of the memory grows well, the bandwidth of the memory is again zashibato, but the latency since the times of VAX has become only three times better. Per clock cycle (last line) - it has deteriorated 150 times.
Which means that a cache miss is worth orders of magnitude more than even the heaviest processor instructions.

In the 80s it was simple and great - the cost of access to memory was quite comparable, if not less, of computational instructions (and in general at the floating point),
There is a processor, a disk and memory, the programmer directly operates on them. The code runs transparently and predictably to the clock.

Now in the gland, in fact, everything is different. Memory access - hundreds of cycles. Yes, at a time you can take the whole cache line (32 or 64 bytes), but still wait for hundreds of cycles. In millisecond, for example, it turns out to go to different places of memory about 10,000 times. 100 objects of different classes, calling 10 virtual functions in each - already 20 +% of a millisecond. In game devs - very real numbers. And the memory traffic, generally speaking, is the most important thing that we have.

And this is all about memory. If you climbed to the disk - it is already quite beyond good and evil, there is a latency of tens of millions of cycles.

How to treat it is of course a cache and a hierarchy. L1 - 2 clocks, L2 - 14 clocks, L3 - lets say about 40. Separately for data, separately for instructions.
The complex logic of the cache, the know-how of various manufacturers of processors and so on.
In addition - be sure to out of order, to try to do something that does not depend on waiting.
Out of order execution, register renaming, necessarily powerful branch prediction, be sure to start access and write to memory as early as possible. If the brunch goes the wrong way, it immediately pulls down out of order and is a disaster.
Again, there is a long conveyor inside. On P4, it was even pathologically long - up to 25 instructions at a time and out of order looking ahead a hundred. On the latest processors, the pipeline is smaller, but still opaque.

Sutter writes that on Itanium2 the cache occupies 85% of the processor area.
On the Core Duo - I could not google, I think about the same.
Another 10 percent or more is the logic of out of order, branch prediction and other good.
There are only a few percents left on the ALU proper, which actually count something.

A modern processor is not a calculator, but a giant hardware emulator of x86 instructions.
All this is necessary in order to hide latency from the programmer. So that you can continue to program in the 80s - when there is only process and memory, and memory can be accessed as often as you like. To continue to run the old code all the better, so that you can write new as well.

And yet - we are trying to hide the speed drop 150 times! Unnoticed by the programmer! Without changing its data structures! So that he did not notice a change in the order of execution of instructions!

Of course, this exercise will never be optimal.

From the fact that the programmer in a sense lives in the land of elves, Sutter makes two practical consequences.

The first is that it affects the correctness of the programs.

Everywhere, where assumptions are made about the sequence of readings-records in memory, in the multithreading favorite by Satter.
If, assuming that the int record in memory is atomic, start making lock free interaction between threads, you will get hurt.

For example:

Thread1:
flag1 = 1;
if (flag2 != 0) { …}
// enter critical section

Thread2:
flag2 = 1;
if (flag1 != 0) { …}
// enter critical section

Thread1 first sets flag1 - the flag of what it wants a shared resource, and checks if the second resource is busy with another thread. It is assumed that flag2 will be checked only after installing flag1 (so as not to enter the critical section if it is occupied by another thread).
And there will be a total translation - memory read on flag1 will happen very early due to out of order (formally, this read does not depend on anything, so it can be done early), and there will be no synchronization.
Therefore, you need to honestly lock. It is impossible to rely on memory as something that reflects the values of variables.

The second and most fun - of course, performance.

For a long time, it basically slows down memory. Mainly due to latency, not bandwidth. Random reading of memory is much more expensive than a whole cloud of calculations. Locality matters, on all scales.

By the way, what is “accidental” in a real program is terribly blurred because of the opaque hierarchy of caches.
It seems that if you use a lot, then it will be in the cache. On the other hand, how much is a real working set at different times - not to estimate it properly.
And it is different on each processor. And it is extremely dependent on the data. And the great thing - he still has to die!
I reduced the example to synthetic - it began to fit into the cache. Hello

Fortunately (unfortunately?), The price of a cache-mission is so high that serious problems can be measured through a thick layer.
The speed of random access (measure latency) versus sequential access (measure bandwith) differs by an order of magnitude. This is the difference between std :: vector vs std :: list.
Worse, it can be the difference between std :: vector <T> vs std :: vector <T *> (this, as everyone knows, and an array of objects in Java or .net).

In the end, you should always think about memory. Both about locality and about costs.
To measure, whether in memory rested. When in random access - you can productively think and solve. And when in the footprint - it happens too.

At the gamedeff , a good example of such a struggle for locality was described here .

But it is still impossible to accurately measure and predict. Everything is very thick, non-linear and opaque. Under you works a large machine with incomprehensible logic and, worse, incomprehensible loading. The network will come to life in the background and will confuse everything. Or an indexer, God forbid.

And I do not know what to do with it in the PC world

Forgive me, I wrote only games from applications and I will reason and compare platforms only using the example of my favorite game dev.

On the one hand, I want more control. Have a clear place in the cache where I can have guaranteed access time. To have certain guarantees that I will not be spoiled by the cache at the first context switch.
For example, it is easy to talk about how well everything is in the console world, where a completely different iron. SPU, 256 kb fully managed very fast local memory, clear requests to main memory with wide (to hide latency) DMA packages. Or Xbox360, where you can lock part of the cache for a while, and even ask the GPU to render from it.
None of these models will heal on a PC in its pure form.
On one processor, there is a multitude of threads simultaneously, if everyone manages 256 kilobytes of memory, then with a context switch, it must be completely unloaded and loaded. There will be a heavy and long context switch, and typically in the OS, well, just dofig even semi-active threads.
Lock cache cannot be allowed for the same reasons - it means either to buffer it into memory during a context switch, or to take it away from other applications forever. If they take even active ones, the rest will slow down.

Worse, the basic applications are without any upper bounds. Can download the document and 10 kilobytes, and 100 megabytes. The size of an Excel spreadsheet can differ thousands of times, no upper bounds on memory, as on the console - you can’t put it.

And the iron set, and the amount of memory is always different, the target is viscous - “eat less memory and work faster”. And iron emulates more than it works.

The life of one application in a system on a fixed iron without backward compatibility is fundamentally different from the life of a cloud of heterogeneous on an unspecified gland, with the old code and other requirements. The further I look, the more I think that different worlds.

And this is a small part of the problems. I would say, the fundamental is backward compatibility and a completely different balance on performance than development costs on consoles. But you can write about this infinitely a lot later.

Finally, a brief meditative tifirki (I took from my home machine):

floating point mul: 0.5-4 cycles (on one core)
L1 access (~ 16-32 kb): ~ 2-3 cycles
L2 access (~ 2-4 mb): ~ 15 cycles
Random Memory Access: ~ 200 cycles
Sequential Access with Prefetch: ~ 2 bytes / cycle

It remains to fight, men. Understand the price of abstraction and at this level, do not allow the brain to relax and live in the eighties.

Source: https://habr.com/ru/post/43905/

All Articles