Worst case execution time on x86

In the last post I described how and why interrupt latency is measured on the Atom platform.

Today I will talk about why the same code with the same input data can be executed at different times. For some realtime applications, this is a very undesirable effect that has to be fought.

Just in case, I will say that the term " Worst case execution time " in the title I applied, of course, without understanding the essence of the phenomenon. It has nothing to do with x86.

Suppose we have, for example, an interrupt that is triggered when a certain piece of hardware has adjusted new data to memory using DMA. In the ISR, we drive them through the FIR filter , and give, for example, another piece of hardware. Just do not ask why in this problem x86. This is just an example, and in this example let the mass of other necessary code be simultaneously executed on the machine.
')
In this case, it turns out that each interrupt performs strictly the same sequence of instructions. Each time it must be executed at the same time. So? Not this way!

There are two main traps.
1. Cache. The filter coefficients are taken from memory. They can be in L1D cache, L2 cache, L3 cache (if any), or in memory. You can not tell in advance. If virtual memory was used in the example, DTLB miss could have occurred (or not). Finally, the ISR code itself may lie in L1I cache, L2, etc. The main source of unpredictability here is the fact that the previously executed code could have clogged the cache with its data. And the cache can be shared between real or HT cores.
2. Hypertrading. Depending on which instructions are executed in the second thread, the first thread is slowed down in different ways. (1-3 times, on average one and a half times on the Atom) The slowdown mechanism is slightly different in the Core with the OOO-conveyor and in the Atom with the In-Order conveyor. Atoms hyperlinks, on average, can have a stronger effect on each other’s performance, because they have to share more limited resources, like cache access.

What can be done?
1. Cache. First, use the s / w prefetch. At the beginning of the ISR code, first of all order the filter coefficients in the cache.
If we are talking about a shared L3 cache in the new Core, there is one more way. You can use cache coloring to allocate a region of memory in the cache that will be harder to force out to processes running on other cores. If you patch work with virtual memory in the OS, you can give the application pieces of virtual memory that can only fall into certain areas of the total last level cache. You can divide the L3 cache on the Core into 32 regions, and when allocating memory on different cores, select non-intersecting segments. I saw how this technique helps to change the execution time of a particular code from 400-900us to 500-600us.
Unfortunately, this kind of magic does not apply to fairly long areas of physical memory.

2. Hypertrading. ~~Disable!~~ It is possible to send IPI to the second stream at the beginning of the work of a code that is critical in terms of predictability of performance. And write its handler, which, when receiving such an IPI, will wait, for example, on mwait. So you can continue to use the performance gain that hypertreaming gives, but almost without an undesirable side effect.

The existence of both the above phenomena, in fact, helps to increase overall performance. Just for some realtime applications, possible small (100-500us), but unpredictable delays are very undesirable. Therefore, in new architectures will appear features that help to get more predictable performance. For example, there may be support for finer caching software.

Source: https://habr.com/ru/post/107075/

All Articles

Worst case execution time on x86

More articles: