⬆️ ⬇️

Tale of the impossible bug: big.LITTLE and caching

When someone says the word multi-core, we unconsciously mean SMP. This worked successfully for us until recently, until ARM announced big.LITTLE. The ARM big.LITTLE architecture is the first massively produced example of the AMP architecture , and as we will see later, it raises the bar for multi-core programming complexity even higher.



Tale of the impossible bug



It all started with an error message from the phone with the processor used by the Exynos chipset on Samsung phones in Europe. Applications created with our software fell from SIGILL in completely random places. Nothing could reasonably explain what was happening, and the crash itself occurred with valid processor instructions. This immediately made us suspect an unsuccessful clearing of the instruction cache.



After reviewing the entire JIT code for flushing the cache, we were confident that we were __clear_cache correctly. This made us look at how other virtual machines or compilers reset the cache on the ARM64, and we found a list of typos / corrections for the Cortex A53 specification . The descriptions of these problems from ARM are vague and difficult to understand, so we tried to find a workaround. But here, nothing happened.



Then we went from the other side. Or maybe the problem is in the signal handler? Not. Awkward CPU emulation in user space? Not. Broken libc implementation? Nice try. Faulty hardware? We reproduced it on several devices. Bad luck or karma? Yes!

')

Some of us could not fall asleep with such an amazing puzzle in front of us and continued to look at the application dumps. But there was one funny thing: the faulty address was always on the third or fourth line of memory dumps.







This was our only clue, and when it comes to such an error that is so difficult to understand, then there can be no question of any coincidences. Our memory dumps were 16-byte aligned, while SIGILL always occurred in the range between 0x40-0x7f or 0xc0-0xff . Therefore, we formatted the snapshots of memory in such a way that it was easier to check the operation of the allocator:



 $ grep SIGILL *.log custom_01.log:E/mono (13964): SIGILL at ip=0x0000007f4f15e8d0 custom_02.log:E/mono (13088): SIGILL at ip=0x0000007f8ff76cc0 custom_03.log:E/mono (12824): SIGILL at ip=0x0000007f68e93c70 custom_04.log:E/mono (12876): SIGILL at ip=0x0000007f4b3d55f0 custom_05.log:E/mono (13008): SIGILL at ip=0x0000007f8df1e8d0 custom_06.log:E/mono (14093): SIGILL at ip=0x0000007f6c21edf0 [...] 


Using this, they formulated the first good hypothesis: an unsuccessful cache flush always occurred on the high 64 bytes of each 128-byte block. These numbers, if you are dealing with low-level programming, will immediately remind you of the size of the cache lines. From that moment on, everything began to acquire meaning.



Below is the pseudo-code of how libgcc resets the arm64 cache :



 void __clear_cache (char *address, size_t size) { static int cache_line_size = 0; if (!cache_line_size) cache_line_size = get_current_cpu_cache_line_size (); for (int i = 0; i < size; i += cache_line_size) flush_cache_line (address + i); } 


In the example above, get_current_cpu_cache_line_size is a processor instruction that returns the size of the cache lines, and flush_cache_line clears the cache line at the specified address.



At that time, we used our own implementation of this function, so we decided to launch it separately and display the size of the cache lines by the processor. And suddenly it printed 128 and 64. We double checked that it was in fact. After that, we took the directory of this processor, and it turned out that the older cores (big) have a cache line size of 128 bytes, and the youngest (LITTLE) - 64.



It turned out that at first __clear_cache could be called on a big-core with 128 byte instruction cache lines, and then on one of the LITTLE-cores, skipping all the others at reset. There is simply no place. We deleted the caching and it all worked.



findings



Some ARM processors big.LITTLE may have cores with different sizes of cache lines, and to a large extent no code is ready to deal with this, since all kernels are assumed to be symmetrical.



Worse, even the ARM instruction set is not ready for this. An insightful reader can guess that calculating the cache line on each call is not enough for user code: it may happen that the process runs on one core, and executes __clear_cache with a certain size of the cache line on the other, which may not be true. Thus, we should try to figure out the global minimum cache line size among all cores. Here is our fix for Mono: Pull Request . Other projects have already borrowed our fix: Dolphin and PPSSPP .

Source: https://habr.com/ru/post/320342/



All Articles