In this article I will give a description of the bugs in the BIOS / UEFI laptops that you had to work with and for which you had to adapt the bootloaders. First of all, we will talk about bugs that are not visible to the user, but which can interfere with the work of the loader, even if everything has been done correctly. The bugs were detected both in the interfaces of the respective execution environments and in the SMM code of the Intel processors. The material given is based on accumulated experience, which is stretched over a sufficiently large period of time. Therefore, at the time of writing, the list of specific models was lost. Nevertheless, the list of manufacturers, whose laptops had problems, was kept. The bugs will be described sequentially, from simple to the most complex. Also in the course of the description will be given a way around them.
Before we start
In order to have a complete understanding of the circumstances under which we had to deal with the problems described, I will briefly tell you what sort of work you have to do. There is a product that encrypts the system disk. Therefore, during the PC startup phase, it is necessary to decrypt the drives so that the OS can start. Therefore, a loader was developed that fulfills this role. After installing all of its interceptors, this loader transfers control to the original OS loader. Later in the description process, the term “loader” will be used to refer to our loader. And the term “OS loader” will be used to refer to the boot loader, which we replace.
Bootloader startup issues (Lenovo, UEFI)
UEFI is known to implement global variables. Including there are global variables, each of which describes the PC launch option (load option entry). There is also a global variable BootOrder, which describes the order in which these options are called. Thus, the bootloader was recorded on the UEFI system partition, and a new entry was created for it when this bootloader was placed first in the queue in BootOrder. However, when starting the PC, the Windows bootloader was called. It turned out that UEFI completely ignored the value of BootOrder and always loaded Windows if it found its record.
The problem could be circumvented by replacing the Windows bootloader itself. This, of course, adds work, because now the replacement file must be protected in the operating system itself.
')
Problems sending USB commands (HP, UEFI)
The bootloader works with USB devices. Namely with CCID readers. For working with USB devices, the protocol provided for this purpose was used - EFI_USB_IO_PROTOCOL. The problem was that the running bootloader did not detect a single USB device, when on other PCs the same bootloader detected them. At first glance it might have seemed that these are completely non-functioning USB drivers, but when working with a laptop I could not ignore the fact that the laptop was successfully launched from a flash drive. Then it turned out that the problem occurs when sending commands via the control channel (control transfer pipe) using the UsbControlTransfer function of the EFI_USB_IO_PROTOCOL protocol. The function prototype is shown below.
typedef EFI_STATUS (EFIAPI *EFI_USB_IO_CONTROL_TRANSFER) ( IN EFI_USB_IO_PROTOCOL* This, IN EFI_USB_DEVICE_REQUEST* Request, IN EFI_USB_DATA_DIRECTION Direction, IN UINT32 Timeout, IN OUT VOID* Data OPTIONAL, IN UINTN DataLength OPTIONAL, OUT UINT32* Status );
The function is always returned with the error EFI_USB_ERR_TIMEOUT. It turned out that the EFI_USB_DATA_DIRECTION type was implemented by developers not in accordance with the UEFI specification. The definition of the type itself from the specification is given below.
typedef enum { EfiUsbDataIn, EfiUsbDataOut, EfiUsbNoData } EFI_USB_DATA_DIRECTION;
The error in the type implementation was that on the corresponding laptop EfiUsbDataIn and EfiUsbDataOut were mixed up in places. Consequently, when the loader called the UsbControlTransfer function with the third parameter equal to EfiUsbDataOut, in reality, it was not writing to the device, but reading from it. And vice versa. Since EfiUsbDataOut is first to be found in the application code, it turned out that the USB driver tried to read data from the device, which cannot be sent when sending requests. Accordingly, the function was completed on timeout.
The solution to the problem is extremely ugly. On startup, the loader checked whether the FirmwareRevision field contained the EFI_SYSTEM_TABLE structure of the string “HPQ”, and if so, checked that the FirmwareRevision field contained the value 0x10000001. If both conditions were met, then by calling the corresponding functions we intend to change the values ​​of EfiUsbDataIn and EfiUsbDataOut to opposite ones.
Problems getting USB responses (Fujitsu LifeBook E743, UEFI)
Externally, the problem manifested itself in the fact that not all CCID devices worked in the bootloader. Old families worked flawlessly, no new ones. It turned out that the problem occurs when calling the UsbBulkTransfer function of the EFI_USB_IO_PROTOCOL protocol. The function always returned an EFI_DEVICE_ERROR error.
It is known that the USB host controller communicates with devices with fixed-length packets. Also, USB developers assume that the device can return a short packet. In this case, the host controller returns the transfer completion status not “Success” but “Short Packet”. And the USB driver interpreted this response as an error. Those. UsbBulkTransfer function always returned EFI_DEVICE_ERROR in case the device responded with a short packet.
So it turned out that the old CCID families always answered with long packets, when the new ones - with short ones. The problem was overcome by analyzing the output buffer. The figure below shows the format of the RDR_to_PC_DataBlock CCID device packages. The device returns this packet to commands such as PC_to_RDR_IccPowerOn, PC_to_RDR_Secure, and PC_to_RDR_XfrBlock.
#pragma pack( push, 1 ) struct RDR_to_PC_DataBlock { UINT8 bMessageType; UINT32 dwLength; UINT8 bSlot; UINT8 bSeq; UINT8 bStatus; UINT8 bError; UINT8 bChainParameter; UINT8* abData[0]; }; #pragma pack( pop )
The bMessageType field identifies the packet type, and for the RDR_to_PC_DataBlock packet, it is always 0x80. Therefore, before receiving a response from the device in the buffer, this field was previously reset. If the function UsbBulkTransfer returned an error, then the value of this field was checked, and if it was equal to 0x80, then it was thought that the device actually responded correctly. In this case, the dwLength field was used to calculate the size of the response, and this size was already returned to the original requestor.
Problems when working with a memory card (Toshiba Satellite U200, BIOS)
Externally, the problem manifested itself in the fact that the bootloader refused to work, because could not find a room in which he could fit. The analysis revealed problems during the scanning of the memory card. In the course of this scan, part of the ranges was skipped and not analyzed.
It is about the service 0xe820 interrupt int 15h. Since the loader left part of the code resident, it was required to allocate memory and place its code in this area. For its part, this required a modification of the memory card, so that the operating system did not use the area allocated by us during its launch. Accordingly, during the launch, the entire card was read, properly modified and replaced by interception int 15h.
Below are the input and output parameters of the function for receiving a memory card.
- Input parameters:
- EAX - function code, always equal to 0xe820;
- EBX - continuation, at the first call the value should be equal to 0, at subsequent values ​​it should be equal to the value returned by the function after the call. This register indicates the function from which record to continue receiving the memory card;
- ES: DI is a buffer pointer to which a record is returned that describes a specific memory range;
- ECX - buffer size, must be at least 20, since the first revisions of this function returned records of size 20 bytes. On modern systems, the record size is 24 bytes;
- EDX - signature, always equal to 'SMAP'. Used to verify the interrogator.
- Output Parameters:
- CF - error, if 0, then there is no error;
- EAX - signature, always equal to 'SMAP'. Used to verify the BIOS;
- ES: DI is the buffer pointer, the same as the input;
- ECX is the size of the record that the function returned;
- EBX is the value that should be input to the function to get the next record. Also, do not make assumptions about the value itself, since it can be an offset, an index, or any other entity in the internal representation of the function itself.
Through this function, the loader in a cycle reads the entire memory card. And the bootloader was designed to be directly compatible with future versions of the BIOS. Those. at the input, the ECX register contained 64. As follows from the description of the function itself, the function in the ECX register will return the size of the record that was written to the buffer. Since at present the maximum record size is 24, it could not be larger than this value in the register. Also, the function must always return exactly one record.
However, on a particular laptop, it turned out that the function interprets the ECX value in a slightly different way. Those. it is used not to determine the size of the record that the interrogator maintains, but to determine how much the function can return in general of the records in one call. So it turned out that when calling a function, the loader read not two records, but two. And, therefore, one of them was always ignored by the loader. This led to the fact that the loader could not find a room in which he could place the resident code.
The problem was solved by transmitting the value 24 to ECX. the idea of ​​direct compatibility had to be abandoned. There were thoughts about how to determine the size of the record, but, understanding the stability of different BIOS versions, there is a risk that the algorithm because of this will also not work stably.
Trouble Stopping USB 3.0 and Reinitializing PIC Controllers (HP, BIOS)
Visually, the problem looked like this: after the user successfully connected the smart card and entered the PIN, the screen went dark, a message was displayed that the OS was being loaded, and everything stopped on this message. The PC hangs tight.
Since the BIOS bootloader is based on RTOS, the user shell itself runs in protected processor mode, which, of course, required the re-initialization of the classic PIC controller. Accordingly, when transferring control to the OS loader, the processor returned to real mode. And this in turn required the return of the PIC controller to its original state.
A preliminary analysis revealed that the processor was returning to real mode, but then the PC was hanging. Then it turned out that the problem arose only if the bootloader initialized the USB host controllers. Before returning to the real mode and before returning the PIC controller to the initial state, the USB host controllers also stopped.
USB 3.0 host controller can have a USBLEGSUP register. This register allows you to transfer control of the controller from the BIOS to the OS and vice versa. First of all, it may be necessary, for example, to emulate classic keyboard I / O ports in order to ensure compatibility with older software. Those. when accessing these ports, an SMI interrupt will occur, and the interrupt handler will do the rest. And on modern machines, more and more often, only USB keyboards are used. The register format is described below.
- Capability ID (Bits 0-7) - the identifier of the functionality. For this register, the field is 1
- Next Capability Pointer (Bits 8-15) - Pointer to the next capability register
- HC BIOS Owned Semaphore (Bit 16) - if installed, then the BIOS controls the host controller
- Reserved (Bits 17-23)
- HC OS Owned Semaphore (Bit 24) - before using the host controller, the operating system must set this bit, in response to this, the BIOS will reset bit 16, after which the host controller can be used
- Reserved (Bits 25-31)
RTOS also stops bit 24 of the USBLEGSUP register when the host controller is stopped. Thus, it returns control of it to the BIOS. The RTOS then returns the controller PIC to its original state. It is also known that the PIC controller hardware no longer exists, and it is also emulated by means of the SMM mode. Consequently, when the PIC returned to its initial state, an SMI interrupt occurred when working with its registers. The analysis revealed that since RTOS did not wait for bit 16 in the USBLEGSUP register to be set, and since immediately after setting bit 24 of this register, the PIC controller returned to its initial state, the SMM mode code returned control over the host controller, and SMI interrupt, not processed at all. Since the initialization of the PIC is performed in several steps, the controller remains partially in the non-initialized state. Because of this, the interrupt delivery broke. Immediately after returning the processor to real mode, the first interruption caused the processor to fall on an invalid vector, which is why it started to execute a meaningless stream of instructions.
The problem was overcome by waiting for bit 16 to be set in the USBLEGSUP register before returning the PIC to its initial state.
Interrupt Delivery Problems from the PIC Controller (Dell Latitude E7240, BIOS)
Externally, the problem looked like this: when the bootloader started up and brought up an invitation to connect a smart card, the bootloader hung tight. In this case, the problem arose only when restarting the PC, when turned on, everything worked fine.
A preliminary analysis revealed that the processor fell into the page fault. A subsequent study of the problem showed that RTOS uses separate stacks for each interruption, which are very small (256 bytes). All of these stacks are adjacent, as shown in the figure below.
We also managed to find out that the page fault occurred on the page of memory that followed immediately before the page with the interrupt stacks. Therefore, the subsequent analysis was carried out already at this level.
The RTOS when the host controller USB initializes also includes the delivery of PIC interrupts from the line on which the controller is located. The interrupt handler on a call resolves all interrupts on the processor, after which it sequentially calls the registered handlers for this line. After calling all registered handlers, the interrupt handler sends an Interrupt Complete (EOI) command to the PIC controller.
It is known that the PIC controller has an ISR register. This register is used to determine which interrupts are currently being processed by the processor, and which are not. And if the processor processes a specific interrupt, then even if there is a request on the corresponding line, it will not be delivered. Until the processor issues an EOI command to the PIC controller, after which the PIC will resume delivery of this interrupt.
Subsequent analysis revealed that during the call to the registered PIC handlers, the controller delivered the interrupt again, even though the EOI command was not yet sent to the PIC. Of course, this is a PIC controller emulation error. This led to the fact that at first the stack of the corresponding interrupt was overflowed, then the stack of other interrupts was corrupted, and, ultimately, access was performed to the non-displayed memory page. And this led to a page fault, the handler of which stops the work of the loader.
The problem was managed to be circumvented by prohibiting the delivery of the corresponding interrupt on the PIC controller before calling registered handlers and its resolution after calling them.
Conclusion
The list of bugs is far from complete. Only those cases that could remember are described. Worst of all, a radical solution of the stability problem has not yet been invented. It was possible only to achieve stability only in certain moments. Anyway, there are instances of errors that an experienced developer will have to invent. And even worse, spend three days analyzing and fixing the problem. And some cases are far from easy. Three days to fix the problem is, of course, not so much, but when there are a dozen problems, it is already well out of the work schedule.
Understanding of reality forced on the reverse-engineering of the Windows loader in order to understand what mechanisms it uses. For me, this means that I can also use them safely. If you deviate from these rules, then the work of the loader cannot be guaranteed.
After a couple more problems with USB in UEFI, I came to the conclusion that I’ll put my host controller drivers in the bootloader. To do this, you have to stop those drivers that work in the UEFI itself, and load your own. I never liked to add so-called “crutches”. In addition, such a code will eventually become difficult to develop due to clutter.
As for its drivers, this makes a lot of sense, because There is a FastBoot mode that does not guarantee the loading of USB drivers. This is not a bug, but a stone in the direction of the UEFI itself as a standard that does not provide a mechanism for reloading unloaded drivers.
In conclusion of the description of the problems, I would like to note the following: it seems that the current BIOS / UEFIs are developed in isolation from a complete understanding of the principles of operation of these systems, or testing is not carried out properly. By experience, both have a place. Enough to run Windows and Linux on the manufactured PC. Everything else is the cost of production. And who will blame the client, I think, no need to tell.
Based on experience, BIOS and UEFI are the most unstable execution environments. In particular, the MacBook EFI is a special exception, and it's hardest to work with. But that's another story.