MIT course "Computer Systems Security". Lecture 7: "Sandbox Native Client", part 3
Massachusetts Institute of Technology. Lecture course # 6.858. "Security of computer systems". Nikolai Zeldovich, James Mykens. year 2014
Computer Systems Security is a course on the development and implementation of secure computer systems. Lectures cover threat models, attacks that compromise security, and security methods based on the latest scientific work. Topics include operating system (OS) security, capabilities, information flow control, language security, network protocols, hardware protection and security in web applications.
In rule C4 there is one nuance. You cannot “jump over” the end of the program. The last thing you can jump to is the last instruction. So this rule ensures that when the program is executed in the “engine” of the process, there will be no inconsistency. ')
Rule C5 says that there can be no instructions larger than 32 bytes. We considered a certain variant of this rule when we talked about the multiplicity of the size of the instruction to 32 bytes, otherwise you can jump into the middle of the instruction and create a problem with a system call that can “hide” there.
Rule C6 states that all available instructions can be disassembled from the very beginning. Thus, it ensures that we see every instruction and can check all the instructions that are launched during the execution of the program.
Rule C7 states that all straight jumps are correct. For example, you jump right into that part of the instruction where the target is indicated, and although it is not a multiple of 32, it is still the correct instruction to which the disassembly is applied from left to right.
Audience: what is the difference between C5 and C3 ?
Professor: I think that C5 says that if I have a multi-byte instruction, it cannot cross the boundaries of adjacent addresses. Suppose that I have a stream of instructions, and there is an address 32 and an address 64. So, the instruction cannot cross a border multiple of 32 bytes, that is, it must not begin with an address smaller than 64 and end with an address greater than 64.
That is what the C5 rule says. Because otherwise, having made a jump of multiplicity 32, you can get into the middle of another instruction, where it is not known what is happening.
And rule C3 is an analogue of this prohibition on the side of the jump. It states that whenever you jump, your jump should be a multiple of 32.
C5 also claims that anything in the multiple of 32 is a safe instruction.
After reading the list of these rules, I had a mixed feeling, since I could not assess whether these rules are sufficient, that is, the minimum is a list or a complete one. So let's think about the homework you need to do. I think that in fact in the work of the Native Client there is an error when executing some complicated instruction in the sandbox. I believe that they did not have the correct length coding, which could lead to something bad, but I can’t remember exactly what that error was.
Suppose that the sandbox validator incorrectly gets the length of some instruction. What bad can happen in this case? How would you use this slip?
Audience: for example, you can hide a system call or the return statement ret .
Professor: yes. Suppose there is some kind of fancy version of AND instruction that you wrote down. It is possible that the validator made a mistake and considered that its length is 6 bytes with an actual length of 5 bytes.
What happens? The validator considers the length of this instruction to be 6 bytes and places another valid instruction behind it. But the processor, running the code, uses the actual length of the instruction, that is, 5 bytes. As a result, we have a free byte at the end of the AND instruction, where we could insert the system call and use it to our advantage. And if we insert a byte CD here, it will be like the beginning of another instruction. Next we will place something in the next interval of 6 bytes, and this will be similar to the instruction that begins with the CD byte, although in fact it is part of the AND instruction. After that, we can make a system call and "escape" from the sandbox.
Thus, the validator of the Native Client must synchronize its actions with the actions of the CPU , that is, “guess” how the processor will interpret each instruction. And this should be at every level of the sandbox, which is quite difficult to implement.
There are actually other interesting bugs in the Native Client . One of these is incorrect cleaning of the processor environment during a jump into the trusted services environment of the Trusted Service Runtime . I think we'll talk about it in a second. But the Trusted Service Runtime is mainly going to work with the same set of CPU registers that are designed to run unreliable modules. So if the processor forgets to clear or reload something, the runtime environment may be deceived, considering the unreliable module as a trusted application and doing what it shouldn’t do or what was not part of the developers' intentions.
So where are we now? At the moment, we understand how to disassemble all instructions and how to prevent the execution of prohibited instructions. Now let's see how we store memory and links for both the code and data within the boundaries of the Native Client module.
For performance reasons, the guys from the Native Client are starting to use hardware support to make sure that storing the memory and references doesn't really cause a lot of overhead. But before I consider the hardware support they use, I want to hear suggestions, how could I do the same without hardware support? Can we just provide access to all memory processes within the limits set by the machine earlier?
Audience: You can instrument the instructions to clear all the higher bits.
Professor: yes, that's right. In fact, we see that we have this AND instruction here, and every time we, for example, jump somewhere, it clears the low bits. But if we want to save all possible code that runs within the low 256 MB, you can simply replace the first attribute f with 0 and instead of $ 0xffffffe0 get $ 0x0fffffe0 . This clears the low bits and sets the upper limit to 256 MB.
So it does exactly what you are offering, allowing you to make sure that whenever you jump, you are within 256 MB. And the fact that we are disassembling also makes it possible to verify that all direct jumps are within reach.
The reason they do not do this for their code is that on the x86 platform you can encode AND very efficiently, where all the upper bits are 1. This results in the existence of a 3-byte instruction for AND and a 2-byte instruction for a jump. Thus, we have an additional expense of 3 bytes. But if you need not a single high bit, like this 0 instead of f , then you suddenly have a 5-byte instruction. Therefore, I think that in this case they are worried about overhead costs.
Audience: Is there a problem with the existence of some instructions that give increments to the version you are trying to get? So you can say that your instruction can have a constant offset or something like that?
Professor: I think so. You will probably prohibit instructions that jump to some complicated address formula and will only support instructions that jump directly to this value, and this value always gets AND .
Audience: this is more needed for memory access than ...
Professor: yes, because it is just a code. And for accessing memory on the x86 platform, there are many strange ways of accessing a particular memory location. Usually you must first calculate the location of the memory, then add an additional AND, and only then make access. I think this is the real reason for their worries about performance degradation due to the use of this tool.
On the x86 platform, or at least on the 32-bit platform, which is described in the article, they use hardware support instead of restricting the code and address data referring to unreliable modules.
Let's see how this looks before we figure out how to use the NaCl module in the sandbox. This hardware is called segmentation. It arose before the x86 platform acquired the swap file. On an x86 platform, while the process is running, there is a table of supported hardware. Let's call it the segment descriptor table. It represents a bunch of segments, numbered from 0 to the end of the table of any size. This is something like a file descriptor in Unix , except that each entry consists of 2 values: the base of the base and the length of the lenght .
This table tells us that we have a couple of segments, and whenever we refer to a particular segment, this in a sense means that we are talking about a chunk of memory that begins with the base address and extends over the length length .
This helps us to observe the memory boundaries on the x86 platform, because each instruction, referring to the memory, refers to a specific segment in this table.
For example, when we execute mov (% eax), (% ebx) , that is, we move the memory value from the pointer stored in the EAX register to another pointer stored in the EBX register, the program knows what the initial one is and what the final address is in view of, and store the value in the second address.
But in fact, on the x86 platform, when we talk about memory, there is an implicit thing called a segment descriptor, similar to a file descriptor in Unix . This is simply an index in the descriptor table, and unless otherwise indicated, then each opcode contains a default segment.
Therefore, when you execute mov (% eax) , it refers to % ds , or the data segment register, which is a special register in your processor. If I remember correctly, it is a 16-bit integer that points to this table of descriptors.
And the same goes for (% ebx) - it belongs to the same % ds segment selector. In fact, in x86 we have a group of 6 code selectors: CS, DS, ES, FS, GS and SS . CS call segment selector (call selector) is implicitly used for instructions. So if your instruction pointer points to something, then it refers to having selected the CS segment selector.
Most data references implicitly use DS or ES , FS and GS denote some special things, and SS is always used for stack operations. And if you execute push & pop , then they implicitly come from this segment selector. This is quite archaic mechanics, but it turns out to be extremely useful in this particular case.
If you get access to some address, for example, in the % ds: addr selector, the hardware translates it to the operation with the adrr + T [% ds] .base table . This means that he will take the address of the length of the module from the same table. So whenever you access memory, it has a database of segment selectors in the form of descriptor table entries, and it takes the address you specified and matches it with the length of the corresponding segment.
Audience: so why not use it, for example, to protect the buffer?
Professor: yes, that's a good question! Could we use this to protect against buffer overflow? For example, for each buffer that we have, you can put here the base of the buffer, and there the size of the buffer.
Audience: what if you don’t need to put it in a table before you want to write to it? You do not need it to be there all the time.
Professor: yes. Therefore, I think that the reason that this approach is not often used to protect against buffer overflows is that the number of entries in this table cannot exceed 2 to the 16th power, because the descriptors are 16 bits long, but in fact A few more bits are used for other things. So in fact you can place only 2 to the 13th power in this table. Therefore, if you have an array of data larger than 2 13 in your code, this table may overflow.
In addition, it would be strange for the compiler to directly manage this table, because it is usually manipulated with the help of system calls. You cannot write directly to this table; you must first make a system call to the operating system, after which the operating system places an entry in this table. Therefore, I think that most compilers simply do not want to deal with such a complex system of managing memory buffers.
By the way, Multex uses this approach: it has 2 18 entries for different segments and 2 18 entries for possible offsets. And each shared library fragment or memory fragment are separate segments. They are all checked for a range and therefore cannot be used at a variable level.
Audience: presumably, the constant need to use the kernel will slow down the process.
Professor: yes, that's right. So we have an overhead due to the sudden creation of a new buffer in the stack, you need to make a system call to add it.
So, how many of these elements actually use a segmentation mechanism? You can guess how this works. I think, by default, all these segments in x86 have a base of 0 and a length of 2 to 32. Thus, you can access the entire range of memory that you want. Therefore, for NaCl, they encode base 0 and set the length to 256 megabytes. They then point to all the registers of the 6 segment selectors in this entry for the 256 MB area. Thus, whenever a hardware accesses a memory, it modifies it with an offset of 256 MB. So the ability to change the module will be limited to a range of 256 MB.
I think you now understand how this hardware is supported and how it works, so you could end up using these segment selectors. So what can go wrong if we just implement this plan? Can we jump out of the segment selector in an unreliable module? I think that one thing to be careful about is that these registers are similar to regular registers, and you can move values ​​into and out of them. Therefore, you must ensure that the unreliable module does not distort these segment selector registers. Because somewhere in the descriptor table there may well be a record, which is also the original segment descriptor for a process with a base of 0 and a length of up to 2 32 .
So if an unreliable module was able to change CS , or DS , or ES , or any of these selectors so that they start pointing to this original operating system that covers your entire address space, then you can link memory to this segment and “ jump out of the sandbox.
Thus, the Native Client should have added a few more instructions to this prohibited list. I think they prohibit all instructions of type mov% ds, es, and so on. Therefore, once in the sandbox, you cannot change the segment to which some of the things related to it refer. On the x86 platform , the instructions for changing the segment descriptor table are privileged, but changing the ds, es, and so on themselves. the table is completely unprivileged.
Audience: can you initialize the table so that all unused slots fit zero length?
Professor: yes. You can set the length of the table for something where there are no unused slots. It turns out that you really need this extra slot containing 0 and 2 32 , because the trusted runtime should start in this segment and gain access to the whole range of memory. So this entry is required for running the trusted runtime environment.
Audience: what is needed to change the length of the output table? Professor: you need to have root-rights. Linux actually has a system called modify_ldt () for a table of local descriptors that allows any process to change its own table, that is, there is actually one table for each process. But on the x86 platform , this is more complicated, there is both a global table and a local table. You can change the local table for a specific process.
And now we will try to figure out how we jump and jump out of the process of executing the Native Client or jump out of the sandbox. What does "jump" mean us?
So, we need to run this trusted code, and this trusted code "lives" somewhere above the 256 MB limit. In order to jump there, we will have to cancel all these protections that the Native Client has installed. Basically, they boil down to changing these six selectors. I think our validator is not going to apply the same rules for things located above the 256 MB limit, so this is quite simple.
But then we need to somehow jump into the trusted runtime trusted runtime and reset the segment selectors to the correct values ​​for this giant segment that covers the address space of the entire process — this range is from 0 to 2 32 . Such mechanisms that exist in the Native Client , they called the trampoline trampoline and springboards springboards. They live in a low 64k module. The cool thing is that these “trampolines” and “springboards” are pieces of code that lie in the bottom 64k of the process space. This means that this unreliable module can jump there, because it is a valid code address that is within the limits of 32 bits and within 256 MB. So you can jump on this “trampoline”.