In the last semester of the university, I chose
the CS444 compiler course . There, each group of 1-3 people had to write a compiler from a substantial Java subset in x86. Language to choose from. It was a rare opportunity to compare the implementation of large programs of the same functionality, written by very competent programmers in different languages, and compare the difference in the design and choice of language. Such a comparison has generated a lot of interesting thoughts. It is rare to find such a controlled comparison of languages. It is not perfect, but much better than most of the subjective stories on which people's opinions about programming languages ​​are based.
We made our compiler on Rust, and at first I compared it with the project of the Haskell team. I expected their program to be much shorter, but it was the same size or larger. Same for OCaml. Then I compared it with the compiler in C ++, and there, as expected, the compiler was about 30% more, mainly because of headers, the lack of sum types, and pattern matching. The following comparison was with my girlfriend, who made the compiler independently in Python and used less than half of the code, compared to us, because of the power of metaprogramming and dynamic types. Another companion program on Scala was also less than ours. Most of all I was surprised by the comparison with another team that also used Rust, but they had three times more code due to different design decisions. In the end, the biggest difference in the amount of code was within the same language!
I will tell you why I consider this a good comparison, give some information about each project and explain some reasons for differences in the compiler size. Also draw conclusions from each comparison. Feel free to use these links to go to the section of interest:
Content
- Why I think it is meaningful.
- Rust (basis for comparison)
- Haskell : 1.0-1.6 sizes depending on how you count, for interesting reasons.
- C ++ : 1.4 sizes for obvious reasons
- Python : 0.5 size due to bizarre metaprogramming!
- Rust (another group) : triple size due to a different design!
- Scala : 0.7 size
- OCaml : 1.0-1.6 sizes, depending on how you count, is similar to Haskell
Why I think it is meaningful.
Before you say that the amount of code (I compared both strings and bytes) is a terrible metric, I want to note that in this case it can provide a good understanding. At the very least, this is the most well-controlled example when different teams write the same great program I have heard or read about.
')
- No one (including me) knew that I would measure this parameter, so no one tried to play metrics, everyone just tried to finish the project quickly and correctly.
- All (with the exception of the Python project, which I will talk about later) implemented the program with the sole purpose of passing the same automated test suite in the same time frame, so the results cannot be greatly distorted by groups that solve different problems.
- The project was completed within a few months, with the team, and had to gradually expand and pass both known and unknown tests. This means that it was useful to write clean, clear code.
- In addition to passing the tests of the course, the code will not be used for anything else, no one will read it and, being a compiler for a limited subset of Java into a text assembler, it will not be useful.
- No libraries other than the standard library are allowed, and no parsing helpers, even if they are in the standard library. This means that the comparison cannot be distorted by the powerful compiler libraries that only some commands have.
- There were not only public, but also secret tests. They were run once after the final delivery. This meant that there was an incentive to write your own test code and make sure that the compiler is reliable, correct, and handles complex border situations.
- Although all participants are students, but I consider them to be quite competent programmers. Every at least two years passed an internship, mainly in high-tech companies, sometimes even worked on compilers. Almost all of them program for 7-13 years and are enthusiasts who read a lot on the Internet beyond their courses.
- The generated code was not taken into account, but the grammar files and the code that generated the other code were taken into account.
Thus, I think that the amount of code provides a decent understanding of how much effort it will take to support each project, if it were long-term. I think that not too much difference between the projects also allows you to refute some extraordinary statements that I read, for example, that the Haskell compiler will be more than half the size of C ++ due to the language.
Rust (basis for comparison)
I and one of my comrades used to write more than 10k lines on Rust, and a third colleague wrote, perhaps 500 lines on some hackathons. Our compiler came out in 6806 lines of
wc -l
, 5900 lines of source code (without spaces and no comments), and 220 KB of
wc -c
.
I found that in other projects these proportions are approximately respected, with a few exceptions, which I will note. For the rest of the article, when I refer to strings or a sum, I mean
wc -l
, but it doesn’t matter (if I don’t notice the difference), and you can convert with a factor.
I wrote
another article describing our design , which passed all public and secret tests. It also contains several additional functions that we did for fun, not for passing tests, which probably added about 400 lines. There are also about 500 lines of our unit tests.
Haskell
The Haskell team included two of my friends who wrote perhaps a couple of thousand lines of Haskell each, plus read a lot of online content about Haskell and other similar functional languages, such as OCaml and Lean. They had another teammate whom I didn’t know very well, but it seems that a strong programmer used Haskell before.
Their compiler was 9750 lines for
wc -l
, 357 KB and 7777 lines of code (SLOC). This command also has the only significant differences between these ratios: their compiler is 1.4 times larger than ours in rows, 1.3 times in SLOC and 1.6 times in bytes. They did not implement any additional functions, passed 100% of public and secret tests.
It is important to note that the inclusion of tests most affected this team. Since they carefully approached the correctness of the code, they included 1600 lines of tests. They caught several borderline situations that our team did not catch, but these cases were simply not verified by course tests. So without tests on both sides (6.3 thousand lines versus 8.1 thousand lines) their compiler is only 30% more than ours.
Here I tend to bytes as a more reasonable measure of volume comparison, because in the Haskell project, on average, longer lines, since it does not have a large number of lines from one closing bracket, and
rustfmt
does not break single-line chains of functions into several lines.
Having rummaged with one of my teammates, we came up with the following explanation for this difference:
- We used the handwritten lexical analyzer and the recursive descent method, and they used the generator on NFA and DFA and LR parser , and then the pass to convert the parse tree to AST ( abstract syntax tree , more convenient code presentation). This gave them significantly more code: 2677 lines compared to our 1705, about 1000 lines more.
- They used the fancy AST generic, which shifted to different type parameters as more information was added in each pass. This and more rewriting support functions probably explain why their AST code is about 500 lines longer than our implementation, where we collect struct literals and mutate
Option<_>
fields to add information as we go.
- They also have about 400 lines of code during generation, which are mainly related to the larger abstraction needed to generate and combine the code in a purely functional way, where we simply use mutation and string writing.
These differences plus tests explain all differences in volume. In fact, our files for the folding of constants and the resolution of the context are very close in size. But still, there is still some difference in bytes due to longer lines: probably because more code is needed to rewrite the whole tree on each pass.
In the end, postponing the design decisions, in my opinion, Rust and Haskell are equally expressive, perhaps with a slight advantage of Rust due to the ability to easily use mutation when it is convenient. It was also interesting to know that my choice of the recursive descent method and the handwritten lexical analyzer paid off: it was a risk that contradicted the recommendations and instructions of the professor, but I decided that it was easier and that was right.
Haskell fans will argue that the team probably did not take full advantage of Haskell’s capabilities, and if they knew the language better, they could have done a project with less code. I agree, someone like
Edward Kmett can write the same compiler in a much smaller volume. Indeed, my friend's team did not use many fancy super-abstractions and fancy combinator libraries such as
lens . However, all this affects the readability of the code. All the people in the team are experienced programmers, they knew that Haskell is capable of very fancy things, but decided not to use them, because they decided that their understanding would take longer than they would save, and would make the code harder for others to understand. This seems like a real compromise to me, and the statement that Haskell is magically suitable for compilers goes into something like “Haskell requires extremely high qualifications for writing compilers if you do not care about supporting code with people who are also not very skilled at Haskell.”
It is also interesting to note that at the beginning of each project, the professor says that students can use any language that runs on the university server, but warns that the Haskell teams are different from the others: they have the largest variation in grades. Many people overestimate their abilities and the Haskell teams have the most bad grades, while others do great, like my friends.
C ++
Then I talked to my friend from the C ++ team. I knew only one person in this team, but C ++ is used in several courses at our university, so probably everyone on the team had C ++ experience.
Their project consisted of 8733 lines and 280 KB, not including the test code, but including about 500 lines of additional functions. What makes it 1.4 times the size of our code without tests, which also has about 500 lines of additional functions. They passed 100% of public tests, but only 90% of secret tests. Presumably because they did not implement the fancy vtables arrays required by the specification, which may take 50-100 lines of code.
I didn’t delve very deeply into these differences in size. I guess this is mainly due to:
- They use LR parser and tree rewriter instead of recursive descent method.
- The lack of sum types and pattern matching in C ++, which we used extensively and which were very useful.
- The need to duplicate all the signatures in the header files, which is not in Rust.
We also compared the compile time. On my laptop, the net debug build of our compiler takes 9.7 seconds, the net release is 12.5 seconds, and the incremental debug build is 3.5 seconds. My friend did not have the timings on hand for their C ++ build (using parallel make), but he said that the numbers are similar, with the proviso that they put the implementations of many small functions in the header files to reduce duplication of signatures at the cost of a longer time ( therefore, I cannot measure the net overhead of lines in header files).
Python
My friend, a very good programmer, decided to do the project alone in Python. She also implemented more advanced features (for fun) than any other command, including an intermediate SSA representation with register allocation and other optimizations. On the other hand, since it worked alone and implemented many additional functions, it paid the least attention to the quality of the code, for example, throwing undifferentiated exceptions for all errors (relying on backtracks for debugging) instead of implementing error types and corresponding messages like us.
Its compiler consisted of 4581 lines and passed all public and secret tests. She also implemented more additional functions than any other command, but it’s hard to determine how much additional code it took because many of the additional functions were more powerful versions of simple things that everyone needed to implement, such as folding constants and generating code. Additional functions are probably 1000-2000 lines, at least, so I am sure that its code is at least twice as expressive as ours.
One big part of this difference is probably dynamic typing. Only in our
ast.rs
500 lines of type definitions, and many more types defined elsewhere in the compiler. We are also always limited by the type system itself. For example, we need infrastructure for ergonomic addition of new information to AST as it passes and access to it later. While in Python you can simply set new fields on the AST nodes.
Powerful metaprogramming also explains some of the difference. For example, although she used the LR parser instead of the recursive descent method, in her case, I think it took less code, because instead of going overwriting the tree, her LR grammar included Python code fragments to build AST, which the generator could turn into Python functions using
eval
. Part of the reason why we did not use the LR parser is that building an AST without overwriting a tree will require a lot of ceremonies (creating either Rust files or procedural macros) to link the grammar with Rust code fragments.
Another example of the power of metaprogramming and dynamic typing is the 400-line
visit.rs
file, which is basically a repeating pattern code that implements the visitor on a heap of AST structures. In Python, this may be a short function of about 10 lines, which recursively introspect the fields of the AST node and visit them (using the
__dict__
attribute).
As a fan of Rust and statically typed languages ​​in general, I tend to point out that the type system is very useful for preventing errors and for performance. Unusual metaprogramming can also make it difficult to understand how code works. However, this comparison surprised me that I did not expect that the difference in the amount of code would be so big. If the difference in general is really close to having to write twice as much code, I still think that Rust is a suitable compromise, but still half the code is an argument, and in the future I tend to do something in Ruby / Python if you just need to quickly build something alone, and then throw it away.
Rust (another group)
The most interesting comparison for me was with my friend, who also did a project in Rust with one teammate (whom I did not know). My friend had a good Rust experience. He contributed to the development of the compiler Rust and read a lot. I know nothing about his friend.
Their project consisted of 17,211 raw lines, 15k source lines and 637 KB, not including the test code and the generated code. He had no additional functions, and he passed only 4 of 10 secret tests and 90% of public tests for code generation, because they didn’t have enough time before the deadline to implement more fancy parts of the specification. Their program is three times larger than ours, written in the same language, and with less functionality!
This result was really amazing for me and overshadowed all the differences between the languages ​​that I have studied so far. Therefore, we compared the lists of file sizes
wc -l
, and also checked how each of us implemented some specific things that resulted in a different code size.
It seems that it all comes down to the consistent adoption of various design decisions. For example, their frontend (lexical analysis, parsing, AST building) took 7597 lines against our 2164. They used the lexical analyzer on DFA and the parser LALR (1), but other groups did similar things without so much code. Looking at their weeder file, I noticed a number of design solutions that were different from ours:
- They decided to use a fully typed parse tree instead of the standard homogeneous string-based parse tree. This probably required much more type definitions and additional conversion code at the parsing stage or a more complex parser generator.
- They used tryfrom implementations of traits to convert between the types of the parse tree and the AST types when checking their correctness. This results in a set of 10–20-line
impl
blocks. We used for this purpose functions that return Result
types, which generates fewer lines, and also frees us a bit from the type structure, simplifying parameterization and reuse. Some of the things that for us were single-line branches of match
, they had 10-line impl
blocks.
- Our types are structured to reduce copy-paste. For example, they used separate fields
is_abstract
, is_native
and is_static
, where the constraint-check code had to be copied twice: once for void-typed methods and once for methods with a return type, with minor changes. While our void
was just a special type, we came up with a modifier taxonomy with mode
and visibility
that applied constraints at the type level, and the constraint errors were generated by default for the match operator, which translated the modifier sets into mode
and visibility
.
I did not look at the code for analyzing their compiler passes, but they are great too. I talked to my friend, and it seems that they did not implement anything similar to the visitor infrastructure like ours. I suppose that this, along with some other smaller differences in design, explains the difference in the size of this part. The visitor allows our analysis passes to focus only on the parts of the AST they need, instead of matching the pattern across the entire AST structure. It saves a lot of code.
Their part for generating code consists of 3594 lines, and ours is 1560. I looked at their code and it seems that almost all the difference is that they chose an intermediate data structure for assembler instructions, where we just used string formatting to directly assembler output . They had to define the types and output functions for all the instructions used and the types of the operands. It also meant that building the assembler instructions took more code. Where we had a format statement with short instructions, such as
mov ecx, [edx]
, they needed a giant
rustfmt
operator, broken up into 6 lines, which built instructions with a bunch of intermediate nested types for operands that included up to 6 levels of nested brackets. We could also output blocks of related instructions, such as the preamble of a function, in one format statement, where they had to specify the complete structure for each instruction.
Our team considered the possibility of using such an abstraction as theirs. It was easier to be able to either output the text assembly or directly output the machine code, but this was not a requirement of the course. The same could be done with less code and better performance, using X86Writer
X86Writer
with methods like
push(reg: Register)
. We also took into account that this could simplify debugging and testing, but we realized that viewing the generated text assembler is actually easier to read and test with the help of
testing snapshots , if liberal comments are inserted. But we (apparently, correctly) predicted that it would take a lot of additional code, and there was no real benefit, considering our real needs, so we did not worry.
It is good to compare this with the intermediate representation, which the C ++ team used as an additional function, which robbed them of only 500 additional lines. They used a very simple structure (for simple type definitions and build code) that used operations close to what Java required. This meant that their intermediate representation was much smaller (and, therefore, required less build code) than the resulting assembler, since many language operations, such as calls and casts, were extended to many assembly instructions. They also say that it really helped debug, since it cut a lot of garbage and improved readability. The presentation of a higher level also made it possible to do some simple optimizations on their intermediate presentation. The C ++ team came up with a really good design that brought them much more benefit with a lot less code.
In general, it seems that the common cause of the threefold difference in volume is due to the consistent adoption of various design decisions, both large and small, in the direction of a larger amount of code. They implemented a number of abstractions that we didn't do — they added more code, and missed some of our abstractions, which reduce the amount of code.
This result really surprised me. I knew that design decisions matter, but I wouldn’t have guessed in advance that they would lead to any differences of this size, given that I only examined people I consider to be strong, competent programmers. Of all the comparison results, this is the most significant for me.
It probably helped me that I read a lot about how to write compilers before I took this course, so I could use clever projects that other people invented and found to work well, like AST visitors and the recursive descent method, although they were not taught we have on the course.
What really made me think is the cost of abstraction. Abstractions can facilitate expansion in the future or protect against some types of errors, but they need to be taken into account given that you can get three times more code for understanding and refactoring, three times more places for errors and less time for testing and further development Our training course was different from the real world: we knew for sure that we would never touch the code after it was developed, this eliminates the advantages of proactive abstraction. However, if I had to choose which compiler to expand with an arbitrary function that you will say later, I would choose ours, even without considering my familiarity with it. Just because it has a lot less code to understand, and I could potentially choose the best abstraction for the requirements (for example,intermediate presentation of the C ++ team when I know the specific requirements.
Also in my view the taxonomy of abstractions has strengthened: there are those that shorten the code, considering only current requirements, like our visitor template, and there are abstractions that add code, but provide the advantages of extensibility, debugging, or correctness.
Scala
I also talked with a friend who was doing a Scala project in the previous semester, but the project and tests were exactly the same. Their compiler consisted of 4141 lines and ~ 160 KB of code, not counting the tests. They passed 8 of 10 secret tests and 100% of open tests and did not implement any additional functions. Thus, in comparison with our 5906 lines without additional functions and tests, their compiler is 30% less.
One of the factors of small-scale design was a different approach to parsing. The course allowed the use of the command line tool for the LR table generator. Nobody used it except this command. This eliminated the need to implement the LR table generator. They also managed to avoid writing the LR grammar using a 150-line Python script that scraped the Java grammar web page they found on the Internet and translated it into generator input format. They still needed to do some kind of tree in Scala, but on the whole the parsing stage was 1073 lines compared to our 1443, although our gradient descent method here gave an advantage in terms of volume compared to all other teams.
The rest of their compiler was also smaller than ours, without any obvious big differences in design, although I didn’t delve into the code. I suspect that this is due to differences in the expressiveness of Scala and Rust. Scala and Rust have similar programming functionality, useful for compilers, such as pattern matching, but Scala managed memory saves the code needed for borrow borrower in Rust. In addition, Scala has more varied syntactic sugar than Rust.
Ocaml
Since all members of our team are interning at Jane Street (technology computational trading company - approx. Lane.), I was particularly interested to see the results of other former Jane Street interns who chose OCaml to write the compiler.
Their compiler was 10914 lines and 377 KB, including a small amount of test code and no additional functions. They passed 9/10 secret tests and all public tests.
Like other groups, it seems that the main difference in size is associated with using the LR parser and tree rewriting for parsing, as well as the regex-> NFA-> DFA conversion pipeline for lexical analysis. Their frontend (lexical analysis, syntax analysis, AST construction) is 5548 lines, and ours is 2164, with similar ratios for bytes. They also used testing for their parser with the expectation that looks like our snapshot tests, which put the expected output out of the code, so their tests of the parser were ~ 600 lines of the total, and ours were about 200.
This leaves for the rest of the compiler 5366 lines for them (461 lines of which are interface files with type declarations) and 4642 for us, the difference is only 15% if their interface files are counted, and almost the same size if not counted. It seems that, apart from our design solutions for parsing, Rust and OCaml seem equally expressive, except that OCaml needs interface files, but Rust does not.
Conclusion
In general, I am very glad that I made this comparison, I learned a lot and was surprised many times. I think the general conclusion is that design decisions are much more important than language, but language matters because it gives you the tools to implement different designs.