Translation by David Albert - Understanding C by learning assembly .Last time, Allan O'Donnell talked about how
to study C using GDB . Today I want to show how using GDB can help in understanding assembler.
Levels of abstraction are great tools for creating things, but sometimes they can become an obstacle to learning. The purpose of this post is to convince you that for a solid understanding of C, you also need to have a good understanding of the assembler code that the compiler generates. I will do this using the example of disassembling and parsing a simple C program using GDB, and then we use GDB and the assembly knowledge gained to study how static local variables are structured in C.
Author's note: All code from this article was compiled on x86_64 processor under Mac OS X 10.8.1 using Clang 4.0 with optimization disabled (
-O0 ).
')
Learning assembler with GDB
Let's start with disassembling a program with GDB and learn how to read the output. Type the following program text and save it in the file
simple.c :
int main(void) { int a = 5; int b = a + 6; return 0; }
Now compile it in debug mode and with optimization turned off and run GDB.
$ CFLAGS="-g -O0" make simple cc -g -O0 simple.c -o simple $ gdb simple
Put a breakpoint on the
main function and continue execution until you reach the
return statement . Enter the number 2 after the
next statement to indicate that we want to execute it twice:
(gdb) break main (gdb) run (gdb) next 2
Now use the
disassemble command to display the assembler instructions for the current function. You can also pass the
function name to the
disassemble command to specify a different function to examine.
(gdb) disassemble Dump of assembler code for function main: 0x0000000100000f50 <main+0>: push %rbp 0x0000000100000f51 <main+1>: mov %rsp,%rbp 0x0000000100000f54 <main+4>: mov $0x0,%eax 0x0000000100000f59 <main+9>: movl $0x0,-0x4(%rbp) 0x0000000100000f60 <main+16>: movl $0x5,-0x8(%rbp) 0x0000000100000f67 <main+23>: mov -0x8(%rbp),%ecx 0x0000000100000f6a <main+26>: add $0x6,%ecx 0x0000000100000f70 <main+32>: mov %ecx,-0xc(%rbp) 0x0000000100000f73 <main+35>: pop %rbp 0x0000000100000f74 <main+36>: retq End of assembler dump.
By default, the
disassemble command displays the instructions in AT & T syntax, which is the same syntax used by the GNU assembler. The AT & T syntax has the format:
mnemonic source ,
destination . Where
mnemonic is the names of instructions that the person understands. And
source and
destination are operands, which can be immediate values, registers, memory addresses, or labels. In turn, the immediate values are constants, they have the prefix
$ . For example,
$ 0x5 corresponds to the number 5 in hexadecimal notation. Register names are written with the
% prefix.
Registers
It is worth spending some time studying registers. Registers are data storage locations that are located directly on the central processor. With some exceptions, the size or
width of the registers of the processor determines its architecture. Therefore, if you have a 64-bit CPU, then its registers will have a width of 64 bits. The same applies to 32-bit and 16-bit processors, etc. The speed of access to registers is very high and it is because of this that operands of arithmetic and logical operations are often stored in them.
The x86 family of processors has a number of special and general purpose registers. General-purpose registers can be used for any operations, and the data stored in them does not have special meaning for the processor. On the other hand, the processor in its work relies on special registers, and the data that is stored in them, have a certain value depending on the specific register. In our example,
% eax and
% ecx are general registers, while
% rbp and
% rsp are special registers. The
% rbp register is a base pointer that points to the base of the current stack frame, and
% rsp is a stack pointer that points to the top of the current stack frame. The
% rbp register is always more important than
% rsp , because the stack always starts with the highest memory address and grows towards the lower addresses. If you are not familiar with the concept of “call stack”, then you can find a good
explanation on Wikipedia .
The feature of the x86 family of processors is that they retain full compatibility with 16-bit 8086 processors. During the transition of x86 architecture from 16-bit to 32-bit and finally to 64-bit, the registers were expanded and received new names, to maintain compatibility with code that was written for earlier processors.
Take the general-purpose register AX, which is 16 bits wide. The high byte is accessed by the name AH, and the low byte is named AL. When the 32-bit 80386 appeared, the expanded (Extended) AX or EAX became a 32-bit register, while AX remained 16-bit and became the lower half of the EAX register. Similarly, when x86_64 appeared, the “R” prefix was used and EAX became the younger half of the 64-bit RAX register. Below is a diagram based on a Wikipedia article to illustrate the above links:
|__64__|__56__|__48__|__40__|__32__|__24__|__16__|__8___| |__________________________RAX__________________________| |xxxxxxxxxxxxxxxxxxxxxxxxxxx|____________EAX____________| |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|_____AX______| |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|__AH__|__AL__|
Back to code
This should already be enough to proceed to the analysis of our disassembled program:
0x0000000100000f50 <main+0>: push %rbp 0x0000000100000f51 <main+1>: mov %rsp,%rbp
The first two instructions are called a function prologue or preamble. First of all, we write the old base pointer onto the stack in order to save it for the future. Then we copy the value of the stack pointer into the base pointer. After this,
% rbp points to the base segment of the
main frame's stack frame.
0x0000000100000f54 <main+4>: mov $0x0,%eax
This instruction copies 0 to
% eax . The x86 call convention states that the values returned by the function are stored in the
% eax register, so the above instruction tells us to return 0 at the end of our function.
0x0000000100000f59 <main+9>: movl $0x0,-0x4(%rbp)
Here we have something with which we have never met:
-0x4 (% rbp) . Parentheses make it clear to us that this is a memory address. In this fragment,
% rbp , the so-called base register, and
-0x4 , which is the offset. This is equivalent to writing
% rbp + -0x4 . As the stack grows downward, subtracting 4 from the base stack frame moves us to the actual frame itself, where the local variable is stored. This means that this instruction saves 0 at
% rbp - 4 . It took me some time to figure out what this line is for, and it seems to me that Clang allocates a hidden local variable for an implicit return value from the
main function.
You may also notice that
mnemonic has the suffix
l . This means that the operand will be of type
l ong (32 bits for integers). Other possible suffixes are
b yte ,
s hort ,
w ord ,
q uad , and
t en . If you get an instruction that does not have a suffix, the size of such an instruction will be implied from the size of the source or destination register. For example, in the previous line,
% eax is 32 bits wide, so the
mov instruction is in fact
movl .
0x0000000100000f60 <main+16>: movl $0x5,-0x8(%rbp)
Now we go to the very core of our test program. The assembler line shown is the first line in C in the function
main , and it places the number 5 in the next available slot of a local variable (
% rbp - 0x8 ), 4 bytes lower from our previous local variable. This is the location of the variable
a . We can use GDB to test this:
(gdb) x &a 0x7fff5fbff768: 0x00000005 (gdb) x $rbp - 8 0x7fff5fbff768: 0x00000005
Note that the memory address is the same. You may also notice that GDB sets variables for our registers, so, like all variables in GDB, their name is preceded by the
$ prefix, while the
% prefix is used in AT & T's assembler.
0x0000000100000f67 <main+23>: mov -0x8(%rbp),%ecx 0x0000000100000f6a <main+26>: add $0x6,%ecx 0x0000000100000f70 <main+32>: mov %ecx,-0xc(%rbp)
Next, we put the variable
a in
% ecx , one of our general-purpose registers, add the number 6 to it and store the result in
% rbp - 0xc . This is the second line of the
main function. You could already guess that the address
% rbp - 0xc corresponds to the variable
b , which we can also check with GDB:
(gdb) x &b 0x7fff5fbff764: 0x0000000b (gdb) x $rbp - 0xc 0x7fff5fbff764: 0x0000000b
The rest of the
main function is simply a process of cleaning, which is also called an epilogue.
0x0000000100000f73 <main+35>: pop %rbp 0x0000000100000f74 <main+36>: retq
We retrieve the old base pointer and put it back in
% rbp , and then the
retq instruction throws us to the return address, which is also stored in the stack frame.
Up to this point, we used GDB to disassemble a small C program, went through AT & T's assembler syntax reading and uncovered the topic of registers and operands of memory addresses. We also used GDB to check where the local variables are stored relative to
% rbp . Now we use the acquired knowledge to explain the principles of operation of static local variables.
Understanding static local variables
Static local variables are a very cool feature of C. In a nutshell, these are local variables that are initialized once and retain their value between calls to the function in which they were declared. A simple example of using static local variables is a Python-style generator. Here is one that generates all natural numbers up to INT_MAX.
#include <stdio.h> int natural_generator() { int a = 1; static int b = -1; b += 1; return a + b; } int main() { printf("%d\n", natural_generator()); printf("%d\n", natural_generator()); printf("%d\n", natural_generator()); return 0; }
When you compile and run this program, it will output the first three positive integers:
$ CFLAGS="-g -O0" make static cc -g -O0 static.c -o static $ ./static 1 2 3
But how does this work? To find out, go to GDB and look at the assembler code. I deleted the address information that GDB adds to the disassembled output and now everything fits on the screen:
$ gdb static (gdb) break natural_generator (gdb) run (gdb) disassemble Dump of assembler code for function natural_generator: push %rbp mov %rsp,%rbp movl $0x1,-0x4(%rbp) mov 0x177(%rip),%eax
The first thing we need to do is find out what instruction we are currently on. We can do this by examining the instruction pointer or the team counter. The instruction pointer is a register that stores the address of the next instruction. In the x86_64 architecture, this register is called
% rip . We can access the instruction pointer with the
$ rip variable, or, alternatively, we can use the architecturally independent variable
$ pc :
(gdb) x/i $pc 0x100000e94 <natural_generator+4>: movl $0x1,-0x4(%rbp)
The instruction pointer contains a pointer to the next instruction to execute, which means that the third instruction has not yet been executed, but is about to be.
Since knowing the next instruction is very useful, we will make GDB show us the next instruction every time the program stops. In GDB 7.0 and above, you can simply execute the
set disassemble-next-line on command, which shows all the instructions that will be executed in the next line of program code. But I’m using Mac OS X, which comes with GDB 6.3, so I’ll have to use the
display command. This command is similar to
x , except that it shows the value of the expression after each program stop:
(gdb) display/i $pc 1: x/i $pc 0x100000e94 <natural_generator+4>: movl $0x1,-0x4(%rbp)
GDB is now configured to always show the next instruction before its output.
We have already passed the prologue function, which was considered earlier, so we will start immediately with the third instruction. It corresponds to the first line of code, which assigns 1 to the variable
a . Instead of the
next command, which goes to the next line of code, we will use
nexti , which goes to the next assembler instruction. Now examine the address
% rbp - 0x4 to test the hypothesis that the variable
a is stored exactly here:
(gdb) nexti 7 b += 1; 1: x/i $pc mov 0x177(%rip),%eax
And we see that the addresses are the same, as we expected. The following instruction is more interesting:
mov 0x177(%rip),%eax
Here we expected to see the execution of the instructions in the string
static int b = -1; , but it looks significantly different than what we have met before. On the one hand, there are no references to the stack frame, where we expected to see local variables. There is not even
-0x1 ! In place of this, we have an instruction that loads something from the address
0x100001018 , located somewhere after the instruction pointer, into the register
% eax . GDB gives us a useful commentary with the result of calculating the memory operand, which suggests that
natural_generator.b is located at this address. Let's follow the instructions and see what happens:
(gdb) nexti (gdb) p $rax $3 = 4294967295 (gdb) p/x $rax $5 = 0xffffffff
Although the disassembler shows the
% eax register as the receiver, we
print $ rax , since GDB sets the variables for the full width of the register.
In this situation, we must remember that while variables have types that define signed or unsigned, registers of these types do not have, so GDB displays the value of the
% rax register as unsigned. Let's try again by
casting the value of
% rax to a signed integer:
(gdb) p (int)$rax $11 = -1
Looks like we found
b . We can re-verify this using the
x command:
(gdb) x/d 0x100001018 0x100001018 <natural_generator.b>: -1 (gdb) x/d &b 0x100001018 <natural_generator.b>: -1
So, the variable
b is not only stored in another part of the memory, out of the stack, but also initialized to the value -1 before the
natural_generator function is
called . In fact, even if you disassemble the entire program, you will not find any code that sets
b to -1. All this is because the value of the variable
b is wired in another section of the executable file of our program, and it is loaded into memory along with all the machine code of the operating system loader when the process is started.
With this approach, things start to make sense. After saving
b in
% eax , we move on to the next line of code where we increment
b . This corresponds to the following instructions:
add $0x1,%eax mov %eax,0x16c(%rip)
Here we add 1 to
% eax and write the result back to memory. Let's follow these instructions and see the result:
(gdb) nexti 2 (gdb) x/d &b 0x100001018 <natural_generator.b>: 0 (gdb) p (int)$rax $15 = 0
The following two instructions are responsible for returning the result
a + b :
mov -0x4(%rbp),%eax add 0x163(%rip),%eax
Here we load the variable
a in
% eax , and then add
b . At this stage, we expect the value of 1 to be stored in
% eax . Let's check:
(gdb) nexti 2 (gdb) p $rax $16 = 1
The
% eax register is used to store the value returned by the
natural_generator function, and we are waiting for an
epilog that will clear the stack and result in a return:
pop %rbp retq
We figured out how variable
b is initialized. Now let's see what happens when the
natural_generator function
is called again:
(gdb) continue Continuing. 1 Breakpoint 1, natural_generator () at static.c:5 5 int a = 1; 1: x/i $pc 0x100000e94 <natural_generator+4>: movl $0x1,-0x4(%rbp) (gdb) x &b 0x100001018 <natural_generator.b>: 0
Since
b is not stored on the stack with the other variables, it is still 0 when you call
natural_generator again . No matter how many times our generator will be called, the variable
b will always keep its previous value. All this is because it is stored outside the stack and initialized when the loader puts the program into memory, and not according to any of our machine codes.
Conclusion
We started by parsing the assembler commands and learned how to disassemble the program using GDB. Subsequently, we analyzed how static local variables work, which we could not have done without disassembling the executable file.
We spent a lot of time alternating reading assembly instructions and testing our hypotheses using GBD. This may seem boring, but there is a good reason for the following approach: the best way to learn something abstract is to make it more concrete, and one of the best ways to make something more concrete is to use tools that will help you look behind the layers of abstraction. The best way to learn these tools is to force yourself to use them until it becomes commonplace for you.
From the translator: Low-level programming is not my profile, so if I made some inaccuracies, I would be glad to know about them in the LAN.