CPU cores or what SMP is and what it is eaten with

Introduction

Good day, today I would like to touch on a fairly simple topic that almost no ordinary programmer knows, but each of you, most likely, used it.
It will be about symmetric multiprocessing (SMP in the people) - an architecture that is found in all multi-tasking operating systems, and of course, is an integral part of them. Everyone knows that the more cores the processor has - the more powerful the processor will be, yes it is, but how can the OS use several cores at the same time? Some programmers do not descend to this level of abstraction - they simply don’t need it, but I think everyone will be interested in how SMP works.

Multitasking and its implementation

Those who have ever studied computer architecture know that the processor itself is not able to perform several tasks at once, multitasking gives us only the operating system, which switches these tasks. There are several types of multitasking, but the most adequate, convenient and widely used is preemptive multitasking (you can read its main aspects on wikipedia). It is based on the fact that each process (task) has its own priority, which affects how much processor time it will be allocated. Each task is given one quantum of time, during which the process does something, after the expiration of the quantum of time, the OS transfers control to another task. The question arises - how to allocate computer resources, such as memory, devices, etc. between processes? Everything is very simple: Windows does it itself, Linux uses the semaphore system. But one core is not serious, we go further.

Interrupts and PICs

Perhaps for some it will be news, for someone it will not, but the i386 architecture (I’ll talk about x86 architecture, ARM doesn’t count, because I haven’t studied this architecture, and I have never come across it (even at the level of writing some service or resident program)) uses interrupts (we will only talk about hardware interrupts, IRQ) in order to notify the OS or program of a particular event. For example, there is a 0x8 interrupt (for the protected and long modes, for example, 0x20, depending on how to configure the PIC, more on this later), which is called by PIT, which, for example, can generate interrupts with any necessary frequency. Then the work of the operating system for distributing time slices is reduced to 0, when an interrupt is called, the program stops working, and control is given, for example, to the kernel, which in turn stores the current program data (registers, flags, etc.) and gives control to the next process .

As you probably understood, interrupts are functions (or procedures) that are called at any time by the hardware, or by the program itself. In total, the processor supports 16 interrupts on two PICs. The processor has flags, and one of them - the flag "I" - Interrupt Control. By setting this flag to 0, the processor will not trigger any hardware interrupts. But, I also want to note that there are so-called NMI - Non-Maskable Interrupts - these interrupts will still be called, even if the I bit is set to 0. Using the PIC programming, you can disable these interrupts, but after returning from any interrupt using IRET - they will not be banned again. I note that you cannot track an interrupt call from under a normal program — your program stops running and it doesn’t even notice it after some time (yes, you can check what caused the interruption — but why?
')

PIC - Programmable Interrupt Controller

From wiki:

As a rule, it is an electronic device, sometimes made as part of the processor itself, or else its complex microcircuits, whose inputs are electrically connected to the corresponding outputs of various devices. The input number of the interrupt controller is indicated by "IRQ". This number should be distinguished from the interrupt priority, as well as from the number of the entry in the interrupt vector table (INT). For example, in the IBM PC in real mode of operation (in this mode MS-DOS works) of the processor, the interrupt from the standard keyboard uses IRQ 1 and INT 9.

The original IBM PC platform uses a very simple interrupt scheme. The interrupt controller is a simple counter that either sequentially goes through the signals of different devices or is reset to the beginning when a new interrupt is found. In the first case, the devices have equal priority, in the second device with a smaller (or larger when counting back) sequence number have a higher priority.

As you understand, this is an electronic circuit that allows devices to send interrupt requests, usually exactly 2.

Now let's move on to the topic of the article itself.

SMP

To implement this standard, motherboards began to install new schemes: APIC and ACPI. Let's talk about the first.

APIC - Advanced Programmable Interrupt Controller, an improved version of the PIC. It is used in multiprocessor systems and is an integral part of all the latest Intel processors (and compatible). APIC is used for complex interrupt redirection and for sending interrupts between processors. These things were not possible using the older PIC specification.

Local APIC and IO APIC

In an APIC-based system, each processor consists of a “core” and a “local APIC”. Local APIC is responsible for processing a processor-specific interrupt configuration. In addition, it contains a local vector table (LVT), which translates events, such as “internal clock” and other “local” interrupt sources, into an interrupt vector (for example, the LocalINT1 contact can raise an NMI exception, keeping “ 2 "to the appropriate input LVT).

More information about local APIC can be found in the “System Programming Guide” for modern Intel processors.

In addition, there is an APIC IO (for example, intel 82093AA), which is part of a chipset and provides multiprocessor interrupt control, including both static and dynamic symmetric distribution of interrupts for all processors. In systems with multiple I / O subsystems, each subsystem can have its own set of interrupts.

Each interrupt pin is individually programmed “as either edge or level triggered”. The interrupt vector and interrupt control information can be specified for each interrupt. An indirect case access scheme optimizes the memory space required to access the internal APIC I / O registers. To increase the flexibility of the system in assigning memory usage, the space of the two APIC I / O registers is relocatable, but by default it is 0xFEC00000.

Initialization of the “local” APIC

Local APIC is activated at boot time and can be disabled by resetting bit 11 IA32_APIC_BASE (MSR) (this only works with processors with a family> 5, since the Pentium does not have such MSR), The processor then receives its interrupts directly from the 8259 compatible PIC . However, Intel’s software development guide states that after you turn off the local APIC via IA32_APIC_BASE, you will not be able to turn it on until it is completely reset. The IO APIC can also be configured to work in legacy mode so that it emulates an 8259 device.

Local APIC registers are mapped to the physical page FEE00xxx (see table 8-1 of the Intel P4 SPG). This address is the same for each local APIC that exists in the configuration, which means that you can directly access the registers of the local APIC core where your code is currently running. Please note that there is an MSR that defines the actual APIC base (available only for processors with a family> 5). MADT contains the local APIC base, and on 64-bit systems it may also contain a field defining a 64-bit redefinition of the base address, which you should use instead. You can leave the local APIC database only where you find it, or move it to where you want. Note: I do not think that you can move it further than the 4th GB RAM.

To enable local APIC for receiving interrupts, you must configure the “Spurious Interrupt Vector Register”. The correct value for this field is the IRQ number that you want the false interrupts to associate with the lower 8 bits, and the 8th bit, set to 1 to actually enable APIC (for more information, see the specification). You must select the interrupt number, which is set to the lower 4 bits; The easiest way to use is 0xFF. This is important for some older processors, because for these values, the lower 4 bits must be set to 1.

Disable the 8259 pic correctly. This is almost as important as setting up an APIC. You do this in two stages: masking all interrupts and reassigning the IRQ. Masking all interrupts disables them in the PIC. Interrupt re-mapping is something you probably already did when you used the PIC: you want interrupt requests to start at 32 instead of 0 to avoid conflicts with exceptions (in protected and long (Long) processor modes, because The first 32 interrupts are exceptions). Then you should avoid using these interrupt vectors for other purposes. This is necessary because, despite the fact that you masked all PIC interrupts, it could still produce false interrupts, which would then be incorrectly processed in your kernel as exceptions.
Let's go to SMP.

Symmetric multitasking: initialization

The startup sequence is different for different CPUs. The Intel Programmer's Guide (Section 7.5.4) contains an initialization protocol for Intel Xeon processors and does not cover older processors. For a generic "all processor types" algorithm, see Intel's Multiprocessor Specification.

For 80486 (with external APIC 8249DX), you must use IPIT INIT followed by IPI "INIT level de-assert" without any SIPI. This means that you cannot tell them where to start executing your code (the SIPI vector part), and they always start executing the BIOS code. In this case, you set the CMOS BIOS reset value to “warm start with far jump” (i.e. Set CMOS 0x0F to 10) for the BIOS to perform jmp far ~ [0: 0x0469] ”, and then set the segment and offset AP entry points at 0x0469.

“INIT level de-assert” IPI is not supported on new processors (Pentium 4 and Intel Xeon), and AFAIK is completely ignored on these processors.

For newer processors (P6, Pentium 4), one SIPI is enough, but I'm not sure that older Intel processors (Pentium) or processors from other manufacturers need a second SIPI. It is also possible that a second SIPI exists in the event of a delivery failure for the first SIPI (bus noise, etc.).

I usually send the first SIPI, and then wait to see if the AP increases the number of running processors. If he does not increase this counter within a few milliseconds, I will send the second SIPI. This is different from Intel’s general algorithm (which has a delay of 200 microseconds between SIPI), but trying to find a time source that can accurately measure a delay of 200 microseconds during early loading is not so easy. I also found that on real hardware, if the delay between SIPI is too long (and you are not using my method), the master AP can run the early AP startup code for the OS twice (which in my case will cause the OS to think that we have two times more processors than we actually do).

You can broadcast these signals over the bus to run each device present. However, you can also turn on processors that were turned off specifically (because they were “defective”).

We are looking for information using the MT table

Some information (which may not be available on newer machines) for multiprocessing. First you need to find the structure of the floating pointer MP. It is aligned on a 16-byte boundary and contains a signature at the beginning of "_MP_" or 0x5F504D5F. The OS should search in EBDA, BIOS ROM space and in the last kilobyte of the “base memory”; the size of the base memory is specified in a 2-byte value in 0x413 in kilobytes, minus 1 KB. Here is the structure:

struct mp_floating_pointer_structure { char signature[4]; uint32_t configuration_table; uint8_t length; // In 16 bytes (eg 1 = 16 bytes, 2 = 32 bytes) uint8_t mp_specification_revision; uint8_t checksum; // This value should make all bytes in the table equal 0 when added together uint8_t default_configuration; // If this is not zero then configuration_table should be // ignored and a default configuration should be loaded instead uint32_t features; // If bit 7 is then the IMCR is present and PIC mode is being used, otherwise // virtual wire mode is; all other bits are reserved }

Here is the configuration table, which is indicated by a floating pointer structure:

 struct mp_configuration_table { char signature[4]; // "PCMP" uint16_t length; uint8_t mp_specification_revision; uint8_t checksum; // Again, the byte should be all bytes in the table add up to 0 char oem_id[8]; char product_id[12]; uint32_t oem_table; uint16_t oem_table_size; uint16_t entry_count; // This value represents how many entries are following this table uint32_t lapic_address; // This is the memory mapped address of the local APICs uint16_t extended_table_length; uint8_t extended_table_checksum; uint8_t reserved; }

After the configuration table entries are entry_count, which contain more information about the system, followed by an extended table. Entries are either 20 bytes to represent the processor, or 8 bytes for something else. Here's what the APIC and I / O entries look like.

 struct entry_processor { uint8_t type; // Always 0 uint8_t local_apic_id; uint8_t local_apic_version; uint8_t flags; // If bit 0 is clear then the processor must be ignored // If bit 1 is set then the processor is the bootstrap processor uint32_t signature; uint32_t feature_flags; uint64_t reserved; }

Here is the IO APIC entry.

 struct entry_io_apic { uint8_t type; // Always 2 uint8_t id; uint8_t version; uint8_t flags; // If bit 0 is set then the entry should be ignored uint32_t address; // The memory mapped address of the IO APIC is memory }

We are looking for information using APIC

You can find the MADT (APIC) table in ACPI. The table lists the local APICs, the number of which should correspond to the number of cores on your processor. Details of this table are not here, but you can find them on the Internet.

Run AP

After you have gathered the information, you need to disable the PIC and prepare for APIC I / O. You also need to configure the local APIC's BSP. Then start the AP using SIPI.

Startup code:

I note that the vector that you specify at startup says the starting address: vector 0x8 - address 0x8000, vector 0x9 - address 0x9000, etc.

 // ------------------------------------------------------------------------------------------------ static u32 LocalApicIn(uint reg) { return MmioRead32(*g_localApicAddr + reg); } // ------------------------------------------------------------------------------------------------ static void LocalApicOut(uint reg, u32 data) { MmioWrite32(*g_localApicAddr + reg, data); } // ------------------------------------------------------------------------------------------------ void LocalApicInit() { // Clear task priority to enable all interrupts LocalApicOut(LAPIC_TPR, 0); // Logical Destination Mode LocalApicOut(LAPIC_DFR, 0xffffffff); // Flat mode LocalApicOut(LAPIC_LDR, 0x01000000); // All cpus use logical id 1 // Configure Spurious Interrupt Vector Register LocalApicOut(LAPIC_SVR, 0x100 | 0xff); } // ------------------------------------------------------------------------------------------------ uint LocalApicGetId() { return LocalApicIn(LAPIC_ID) >> 24; } // ------------------------------------------------------------------------------------------------ void LocalApicSendInit(uint apic_id) { LocalApicOut(LAPIC_ICRHI, apic_id << ICR_DESTINATION_SHIFT); LocalApicOut(LAPIC_ICRLO, ICR_INIT | ICR_PHYSICAL | ICR_ASSERT | ICR_EDGE | ICR_NO_SHORTHAND); while (LocalApicIn(LAPIC_ICRLO) & ICR_SEND_PENDING) ; } // ------------------------------------------------------------------------------------------------ void LocalApicSendStartup(uint apic_id, uint vector) { LocalApicOut(LAPIC_ICRHI, apic_id << ICR_DESTINATION_SHIFT); LocalApicOut(LAPIC_ICRLO, vector | ICR_STARTUP | ICR_PHYSICAL | ICR_ASSERT | ICR_EDGE | ICR_NO_SHORTHAND); while (LocalApicIn(LAPIC_ICRLO) & ICR_SEND_PENDING) ; } void SmpInit() { kprintf("Waking up all CPUs\n"); *g_activeCpuCount = 1; uint localId = LocalApicGetId(); // Send Init to all cpus except self for (uint i = 0; i < g_acpiCpuCount; ++i) { uint apicId = g_acpiCpuIds[i]; if (apicId != localId) { LocalApicSendInit(apicId); } } // wait PitWait(200); // Send Startup to all cpus except self for (uint i = 0; i < g_acpiCpuCount; ++i) { uint apicId = g_acpiCpuIds[i]; if (apicId != localId) LocalApicSendStartup(apicId, 0x8); } // Wait for all cpus to be active PitWait(10); while (*g_activeCpuCount != g_acpiCpuCount) { kprintf("Waiting... %d\n", *g_activeCpuCount); PitWait(10); } kprintf("All CPUs activated\n"); }

 [org 0x8000] AP: jmp short bsp ;     -   BSP xor ax,ax mov ss,ax mov sp, 0x7c00 xor ax,ax mov ds,ax ; Mark CPU as active lock inc byte [ds:g_activeCpuCount] ;   ,   jmp zop bsp: xor ax,ax mov ds,ax mov dword[ds:g_activeCpuCount],0 mov dword[ds:g_activeCpuCount],0 mov word [ds:0x8000], 0x9090 ;  JMP   2 NOP' ;   ,

Now, as you understand, in order for the OS to use many cores, you need to configure the stack for each core, each core, its interrupts, etc., but the most important thing is that when using symmetric multiprocessing, all the resources of the cores are the same: one memory one PCI, etc., and the OS can only parallelize tasks between cores.

I hope that the article was not tedious enough, and quite informative. Next time, I think you can talk about how you used to draw on the screen (and now draw), without using shaders and cool video cards.

Good luck!

Source: https://habr.com/ru/post/426497/

All Articles