There was a time when it was necessary to evaluate
memory performance in the context of Hyper-threading technology . We came to the conclusion that its influence is not always positive. When the free time quantum appeared, there was a desire to continue research and consider the processes taking place with an accuracy of machine clock ticks and bits using the software of our own design.
Explored platform
The object of the experiments is an ASUS N750JK laptop with an Intel Core i7-4700HQ processor. Clock speed 2.4GHz, increased in Intel Turbo Boost mode to 3.4GHz. Installed 16 gigabytes of DDR3-1600 RAM (PC3-12800), working in dual channel mode. The operating system is Microsoft Windows 8.1 64 bits.

Fig.1 Configuration of the platform under study.
')
The processor of the platform under investigation contains 4 cores, which, when enabled, Hyper-Threading technology provides hardware support for 8 threads or logical processors. This information Firmware platform transmits the operating system through the ACPI-table MADT (Multiple APIC Description Table). Since the platform contains only one memory controller, the System Resource Affinity Table (SRAT) table, which declares the proximity of the processor cores to the memory controllers, is absent. Obviously, the laptop under study is not a
NUMA platform , but the operating system, for the purpose of unification, treats it as a NUMA system with a single domain, as indicated by the line NUMA Nodes = 1. The fact of principle for our experiments is that the first level data cache size 32 kilobytes for each of the four cores. Two logical processors dividing one core share the first and second level cache memory.
Operation under investigation
We will explore the dependence of the speed of reading a block of data on its size. To do this, we choose the most efficient method, namely, reading 256-bit operands using the AVX instruction VMOVAPD. On the graphs, the block size is plotted along the X axis, and the read speed is along the Y axis. In the vicinity of point X, which corresponds to the size of the first-level cache, we expect to see the inflection point, since performance should drop after the block being processed goes beyond the cache memory. In our test, in the case of multi-threaded processing, each of the 16 initiated threads works with a separate address range. To control the Hyper-Threading technology within the application, each thread uses the SetThreadAffinityMask API function, which sets a mask in which each logic processor has one bit. A single bit value allows the specified processor to use the specified stream, a zero value - prohibits. For 8 logical processors of the platform under study, mask 11111111b allows using all processors (Hyper-Threading is enabled), mask 01010101b allows using one logical processor in each core (Hyper-Threading is disabled).
The following abbreviations are used on the graphs:
MBPS (Megabytes per Second) -
block read speed in megabytes per second ;
CPI (Clocks per Instruction) - the
number of cycles per instruction ;
TSC (Time Stamp Counter) -
CPU clock counter .
Note. The clock frequency of the TSC register may not correspond to the processor clock frequency when operating in Turbo Boost mode. This must be taken into account when interpreting the results.
On the right side of the graphs, a hexadecimal dump of instructions that make up the cycle body of the target operation performed in each of the program streams, or the first 128 bytes of this code, is visualized.
Experience number 1. One thread
Fig.2 Reading in one threadThe maximum speed of 213563 megabytes per second. The inflection point occurs when the block size is about 32 kilobytes.
Experience number 2. 16 threads on 4 processors, Hyper-Threading off
Fig.3 Reading in sixteen threads. The number of logical processors used is fourHyper-Threading is turned off. The maximum speed of 797598 megabytes per second. The inflection point occurs when the block size is about 32 kilobytes. As expected, compared to reading with one thread, the speed increased approximately 4 times, in the number of working cores.
Experience number 3. 16 threads per 8 processors, Hyper-Threading enabled
Fig.4 Reading in sixteen threads. The number of logical processors used is equal to eightHyper-threading enabled. The maximum speed of 800722 megabytes per second, as a result of the inclusion of Hyper-Threading almost did not grow. Big minus - the inflection point occurs when the block size is about 16 kilobytes. The inclusion of Hyper-Threading slightly increased the maximum speed, but the speed drop now occurs at half the size of the block - about 16 kilobytes, so the average speed dropped significantly. This is not surprising, each core has its own first-level cache, while logical processors of the same core share it.
findings
The investigated operation is quite well scaled on a multi-core processor. Causes - each of the cores contains its own cache memory of the first and second levels, the size of the target block is comparable to the size of the cache memory, and each of the threads works with its own address range. For academic purposes, we created such conditions in a synthetic test, realizing that real applications are usually far from ideal optimization. But the inclusion of Hyper-Threading, even in these conditions gave a negative effect, with a small increase in peak speed, there is a significant loss in the processing speed of blocks whose size is in the range from 16 to 32 kilobytes.