As I wrote the C compiler in 40 days

I suggest you translate the diary of Rui Ueyama, a programmer from Google, which he led while working on the implementation of the C compiler about three and a half years ago (but only published last December).
This diary is not of any practical use and is not a tutorial, but I was very interested to read it, I hope you enjoy this story too :)

I wrote a C compiler in 40 days, which I called 8cc. This is a diary written by me at the time. The code and its history can be viewed on GitHub .

Day 8

I am writing a compiler. He started working after writing about 1000 lines of code. Here are some examples that already work:

int a = 1; a + 2; // => 3 int a = 61; int *b = &a; *b; // => 61

Arrays are correctly converted to pointers, so the code below also works. Calling functions are also supported.
')

 char *c = "ab" + 1; printf("%c", *c); // => b

It was not difficult to implement, because I do it the second time. I learned how to handle arrays and pointers better.

Day 15

I have come a long way in implementing the compiler and it works surprisingly well. Nontrivial programs, for example this one - the solving problem of eight queens , are compiled and run.

Of course he lacks many functions. These sample programs are chosen not to use them.
The implementation is fairly simple; there is not even register allocation.
It compiles the source programs into the code of the stack engine, which uses the system stack as the stack. Each operation requires access to the memory. But while it suits me.

At the beginning, the compiler would fit approximately 20 lines and the only thing that it was able to do is read the integer value from the standard input and start the program which ends immediately returning this integer.

Now it contains about 2000 lines. If you look at git, it seems that it developed in this way:

Added "+" and "-"
the phases of parsing and code generation are separated
Added "*" and "/"
added variables (implicitly implying int type)
function call added
lines added
tag generator (tokenizer) and syntax analysis are separated
support for declaring basic types
pointers and arrays added
support for array initialization expressions
added "if"
feature declaration supported
added “for” and “return”
pointer assignment is supported
"==" added
array indexing and pointer arithmetic added
Added "++", "-" and "!"

Day 17

I have successfully implemented the structure. A structure is an object that can occupy more than one machine word. They are harder to implement than primitive types, but it was easier than I expected.

It seems to work as it should; I can define a structure containing a structure. I can define a pointer to a structure and dereference it. Structures containing arrays and arrays of structures also work. Although I already knew that the code should theoretically work, I was still glad when it really worked, even in such a difficult case.

However, I do not feel that I fully understand why this code works correctly. It feels a bit magical due to its recursive nature.

I cannot transfer structures to functions. In the x86 calling convention, the structure is copied onto the stack and a pointer to it is passed to the function. But in x86-64, you have to split the structure into several pieces of data and pass them through registers. It is difficult, so I will postpone it. Passing structures by value is needed less often than passing pointers to them.

Day 18

Implementing the association was easier, because it is just a variant of the structure in which all fields have the same offset. Also implemented the operator "->". Easy peasy.

Arranging floating-point support was hard. It seems that implicit type conversion between int and float works, but floating-point numbers cannot be passed to functions. In my compiler, all function parameters are first placed on the stack, and then written to registers in the order defined in the x86-64 calling convention. But in this process apparently there is a bug. It returns a memory access error (SIGSEGV). It is difficult to debug it, considering the assembler output, because my compiler does not optimize the assembler for reading. I thought I could finish this in a day, but I was wrong.

Day 19

I wasted my time because I forgot that in accordance with the x86-64 calling convention, the stack frame must be 16 bytes aligned. I found that printf () drops from SEGV if I pass it some floating point numbers. I tried to find the conditions under which it can be reproduced. It turned out that the position of the stack frame matters, which made me think about the requirements of ABI x86-64.

I didn’t take care of it at all, so the stack frame was only 8 byte aligned, but print () didn’t complain until it took only integers. This problem can be easily corrected by adjusting the stack frame before calling the CALL instruction. But such problems can not be avoided if you carefully do not read the specification before writing code.

Day 20

I changed the indentation in the compiler code from 2 to 4. I am more used to using 2-space indents, because these are used in my work at Google. But for some reason it seems to me that 4 space spaces are more suitable for a “beautiful open source program.”

There is another, more significant, change. I rewrote the tests from the shell scripts in C. Before this change, each test function compiled by my compiler was linked to main () which was compiled by GCC and then launched by the shell script. It was slow, because spawned many processes for each test. I had no choice when I started the project, because my compiler didn't have many functions. For example, he could not compare the result with the expected value due to the lack of comparison operators. Now it is powerful enough to compile test code. So I rewrote them to make them faster.

I also implemented large types such as long and double. Code writing was fun because I very quickly succeeded in implementing these functions.

Day 21

I almost finished the implementation of C preprocessor in one day. Actually, this is the port from my previous attempt to write a compiler.

Implementing the C preprocessor is not an easy task.

This is part of the C standard, as defined in the specification. But the specification says too little for it to be useful for self-implementation. The specification includes several macros with their expanded form, but it says very little about the algorithm itself. I think she doesn't even explain the details of his expected behavior. In general, it is underspecified.

As far as I know, the PDF in this blog is the only and best resource on how to implement the C preprocessor. The algorithm described in the document (Dave Prosser's algorithm) attempts to deploy as many tokens as possible, avoiding the infinite unfolding of the macro. Each token has its own history of unfolding, so that tokens do not unfold with the same macro more than once.

Preprocessor C itself is an independent language. It has many features, and only experienced C programmers understand it well.

Day 22

Tried to get the compiler to read the system headers, so now he understands #include. While I tried, I was getting a lot of errors. This revealed that my preprocessor still lacks many functions, for example, operators that can only be used in #if. There are many bugs besides this. I corrected them as soon as I found them.

System header files are large and confusing. They require many functions from the compiler, such as enum and typedef. I implemented them one by one, but sometimes I cut corners. I am trying to read stdio.h. I have no idea how long this can take.

The compiler now consists of 4000 lines. The small LCC compiler contains 12,000 lines. Using it as a guide, I think my compiler will soon be able to work like a real C compiler.

I'm surprised I wrote 500 lines of code today. I can work in a stream for 12 hours, but this may be inefficient, because I get tired without noticing it. In any case, I must admit that I am a person with a lot of free time.

Day 24

I don't remember what I fixed, but now stdio.h can connect. This is very cool, because the types of functions defined in the header file are now correctly processed.

The scheme I use to implement my compiler involves creating a compiler for a small subset of the language and then developing it into a real C language. Until recently, I didn’t really try to implement functions that I don’t fully understand. I could write as much code as I want and leave the rest. It was fun.

Outside things, such as the header system, have caused many problems. Now I have to implement all the functions expected from the “real” C compiler. I did a lot of dirty hacks to read stdio.h. For example, I implemented a hack to ignore all occurrences of the token "const". It upsets me.

You may ask why not do it in the right way from the very beginning. I would say that it is not fun. Too. For example, the syntax of declaring types in C is too complicated without any sane reason and it is not at all interesting to implement it.

Despite this there are some things that I can not avoid. Probably, I should change my view in order to realize all the functions, from beginning to end. I may find this interesting as I approach the goal. Sometimes, I have to write more code than I want in order to achieve a goal.

Day 25

I was stuck for two days with the implementation of the syntax of definitions and declarations without any success. Why can't I finish this? I made pointers and structures in one day of work. I feel that I underestimated it. Maybe I need a plan?

Day 26

In a difficult situation like this, I probably should remember the fact that the compiler was just in one file to see how I’ve made progress in a month. It simply read the whole through scanf () and printed it through printf (). Indeed, I very seriously advanced in one month. Yes, I think I can do it.

Day 28

Finished writing a parser for ads and definitions. I think the reason why I failed was that I tried to write too many details from the very beginning, so I wrote pseudo-code to make sure I understood everything correctly and then converted it into real code.

I have been writing in C for almost 15 years, but only today I felt that I finally understand the type syntax in C. Not surprisingly, I could not write working code. This is because I just did not understand him correctly.

The code I just wrote is too complex and fragile, that even I hardly understand it. I do not believe that Dennis Ritchie, the creator of C, understood the consequences of what he did. I suspect that he invented the syntax, wrote the code, which turned out to be more complicated than he expected, and, as a result, was standardized by the ANSI committee. It’s hard to implement a standardized language, because you have to do it right. Rather, it is easier to write your own toy language.

Day 29

Implemented many more operators and cleaned the code.

Today, for the first time, my compiler managed to compile one of its files. When I linked it to other files compiled with GCC, it turned out to be working. And the resulting compiler also seems to work. Looks like the goal is getting closer.

Day 30

I implemented switch-case, continue, break and goto today. When I wrote test cases for goto, the test code quickly turned into a spaghetti code that I could not read. It made me laugh. I made sure why goto is considered harmful .

Day 31

Implemented functions for varargs, namely va_start, va_arg and va_end. They are not used often, but I needed them to compile functions, such as printf.

The vararg specification for C is not well thought out. If you pass all function arguments through the stack, va_start can be implemented fairly easily, but on modern processors and in modern calling conventions, arguments are passed through registers to reduce the overhead of calling functions. Therefore, the specification does not correspond to reality.

Roughly speaking, the ABI for x86-64, standardized by AMD, requires that functions with a variable number of arguments copy all registers to the stack in order to prepare for the subsequent call to va_start. I understand that they had no other choice, but it still looks clumsy.

I wondered how other compilers handle functions with a variable number of arguments. I looked at the TCC headers and it looks like they are not compatible with ABI x86-64. If the data structure for varargs is different, then the functions passing va_list (such as vprintf) become incompatible. Or I'm wrong? [And I’m really wrong — they are compatible.] I also looked at Clang, but it looks confusing. I did not read it. If I read too much code from other compilers, it can spoil the fun of my own implementation.

Day 32

After fixing minor problems and adding control sequences for string literals (there were still no '\ 0' and similar things), we managed to compile another file. I feel confident progress.

I tried to implement support functions with more than six parameters, but could not finish it in one day. In x86-64, the first 6 integer parameters are passed through registers, and the rest through the stack. Now only transfer via registers is supported. Passing through the stack is not difficult to implement, but it takes too much time to debug. I think in my compiler there are no functions with more than six parameters, so I’ll postpone their implementation for now.

Day 33

Three more files compiled today. Total 6 out of 11. But if we count the lines of code, then this is about 10% of the total. The remaining files are much larger because they contain the compiler core code.

Even worse, in the kernel files I use relatively new C features, such as compound literals and assigned initializers. They greatly complicate self-compilation. I should not have used them, but rewriting the code on plain old C will not be productive, so I want to support them in my compiler. Although it will take time.

Day 34

A few notes on debugging tools. Since the compiler is a complex piece of code that consists of many steps, a way is needed to somehow investigate it for debugging. My compiler is no exception; I implemented several features that I thought were useful.

First, the lexical analyzer remembers its reading position and when it is interrupted for unforeseen reasons, it returns this position. This makes it easy to find a bug when the compiler does not accept valid input data.

There is a command line option for printing an internal abstract syntax tree. If there is an error in the parser, I want to look at the syntax tree.

The code generator allows recursion to be used extensively because it generates fragments of assembly code when it crawls an abstract syntax tree. So I was able to print a mini stack trace for each line of assembly output. If I notice something wrong, I can trace the code generator by looking at its output.

Most internal data structures have functions for converting to strings. This is useful when using printf for debugging.

I always write unit tests when I write a new feature. Even implementing it, I try to keep the code compiled to run the tests. Tests are written to be performed in a short period of time, so that you can run them as often as you like.

Day 36

Implemented composite literals and rewrote the initializer of structures and arrays. I did not like the previous implementation. Now the initializer is better. I had to write beautiful code from the very beginning, but since I realized this, only by writing working code, rewriting was inevitable.

I think the only feature that is not enough for self-compilation is the assignment of structures. I hope everything will work as intended without much debugging when it is implemented.

Day 37

The file containing the tokenizer is compiled, but the resulting second-generation compiler does not generate the correct assembler code for some reason. Although the code generated by the first-generation compiler passes all the tests. Such an insidious bug.

I think that I have no choice but to use printf for debugging, because the second generation is compiled through my compiler, which does not support debugging information. I added printf in suspicious places. Printf debug messages were displayed when compiling the second generation, which surprised me a little. I wanted debug messages to be output only when I use the second generation, so I didn’t expect the output to work when the second generation is being created .

It reminds me of the film "The Beginning". We have to go deeper to reproduce this bug. This is the fun part of debugging a self-compiled compiler.

Day 38

I fixed the problem that occurred in the second generation, if the lexical analyzer was self-compiled. It caused a bug where -1> 0 sometimes returned true (I forgot about the significant extension). There is another bug in struct layout. Only three files left.

Day 39

The code generator can now also compile itself. There are two files left. The work is almost finished, although I probably should not be overly optimistic. There may still be unexpected pitfalls.

I fixed a lot of problems caused by the poor quality of the code that I wrote at the early stage of this project. That tired me out.

I believed that I had every opportunity for self-compilation, but this is not true. There is not even a prefix increment / decrement operator. For some features of C99, I rewrote part of the compiler to make it more convenient to compile. , C, .

40

Hooray! !

40 . , C. ? , — C, C . .

, , , .

. , , , .

. , , C, 5000 .

41

, . , , , , , . .

, , . , GCC .

42

, , .

, "\n" ( «n») "\n" ( ). , , ASCII "\n". , , . GCC.
, .

. , Unix , Unix. , ( ) . , , , . , ?

43

(operator-precedence parser). , , , C (, ). , , . .

— . .

, , , , . , , , . , . , .

44

: → → → → x86-64 . , . , , . , , , .

Dragon Book , , . , .

GCC . , . , .

45

, gcov . , . , . .

46

, , . , , .

. . , . . , , .

, Dragon Book. , - .

52

: 16- , . , .

: . , 16- . . .

, GCC , . , . , — . , .

, : . C , , , , . ( ), . . , .

, . - .

53

. GCC, — , . ABI. , , .

, . , , . , . . , GCC .

: , true, GCC 8 . , , 512 (= 2 ⁹ 0x100), true, GCC -. GCC 8 , false.

- , , , GCC ( ) . , . - . , , , , .

x86-64 ABI , 8 . , , , . . — , , .

55

.

, . GCC (, , ):

. , «int x:5», x 4- (, int — 32 ).
. , «int x:10», x 4- . , , .

, CPU . , , & . , , , &, «», , .

56

goto (computed goto). goto , goto . C GCC. - switch-case, , , . , switch-case goto.

goto . , . , .

57

C11 _Generic, . , , GCC 4.6.1 _Generic. , GCC .

typeof(), GCC. , , .

58

C99. — , . , "<:" "[". , , .

89 , . — , , . , printf(«huh??!») «huh??!», «huh|» "??!" "|". . , .

62

TCC . - .

TCC — C, 20 30 . x86-64, , , 10-20. , . , , — .

TCC, . , , , . . , , -, , .

73

TCC. , , TCC. , .

Source: https://habr.com/ru/post/274083/

All Articles