Lessons to learn from the LLVM / Clang codebase

From the translator: in the article that I bring to your attention, the authors investigated the LLVM / Clang code base using the CppDepend code analysis tool, which allows you to calculate various code metrics and analyze large projects in order to improve the quality of the code.

Time has proven that Clang is as mature a C and C ++ compiler as GCC and Microsoft’s compiler, but what makes it special is that it’s not just a compiler. This is an infrastructure for creating tools. Due to the fact that its architecture is based on the use of libraries, the reuse and integration of functionality into your project is made easier and more flexible.

Clang structure

Like many other compilers, Clang consists of three phases:

The frontend, which parses the source code, checks for errors, and builds a language-dependent abstract syntax tree (AST) representing the input code.
Optimizer: its goal is to optimize the AST generated by the front-end.
Backend: generates the final code executed by the machine, depending on the target machine.

What is the difference between Clang and other compilers?

The biggest difference is that Clang is based on LLVM, and the main idea of LLVM is to use intermediate representation (IR), which is something like Java bytecode.

LLVM IR is designed to support intermediate stages of analysis and conversion that are in the optimizing stage of the compiler. It is designed taking into account many specific requirements, including support for “light” runtime optimizations, interprocedural / interfunctional optimizations, program analysis in general, aggressive structural transformations, etc. The most important aspect, however, is that the intermediate representation itself is a first-class language with well-defined semantics.

With this structure, we can reuse most of the compiler to create other compilers, for example, you can replace the frontend to support other languages.

It is very interesting to climb inside this powerful toy and see how it is designed and implemented. C ++ developers can learn a lot of good practices from this code base.

Let's X-ray the source code using CppDepend and CQLinq to understand some of the developer’s decisions.

1. Modularity

1.1. Modularity and use of libraries

The main concept in the development of clang was the use of libraries. The different parts of the frontend can be clearly divided into different libraries that can be shared for different purposes. This approach encourages the use of good interfaces and makes it easier for new developers (since they only need to understand a small part of the big picture).

DSM (Dependency Structure Matrix), a dependency structure matrix is a compact way of representing dependencies between components. A non-empty matrix cell contains a number. This number expresses the bond strength represented by the cell. Link strength can be expressed as the number of members / methods / fields / types and namespaces involved in the link.

This dependency graph shows us the libraries that clang uses directly.

As we can see, there are three cyclic dependencies between the clangBasic / clangFrontEnd, clangBasic / clangDriver and clangBasic / clangLex libraries. It is recommended to remove any circular dependencies between libraries so that the code is more readable and easier to maintain.

Why does clangFrontend use the clangBasic library?

Only one enum field causes cyclic dependencies, the code can be refactored, and the dependency can be easily eliminated.

1-2 Modularity using namespaces

In C ++, namespaces are used to give code modularity, and in LLVM / clang they are used for three main reasons:

Many namespaces contain only enums, as shown in the following CQLinq query.

In a large project, it cannot be guaranteed that two different enumerations will not be named the same. The problem was resolved in C ++ 11, using enumeration classes, which involve using enumeration values along with the enumeration name. The code may be refactored in the near future using the C ++ 11 enumeration classes.

Anonymous namespaces: Nameless namespaces that allow you to avoid creating global static variables. The anonymous namespace you created will only be available in the file in which it was created. Here is a list of all the anonymous namespaces used.

All non-anonymous namespaces:

Namespaces are a good solution to make an application modular, LLVM / clang defines more than 500 namespaces to provide modularity to make code readable and maintainable.

2. Using paradigms

C ++ is not just an object-oriented language. Bjarn Straustrup indicated that C ++ is a multi-paradigm language. It supports many programming styles, or paradigms, and object orientation is just one of them. Others are procedural programming and generic programming.

2.1. Procedural programming

2.1.1. Global functions

Find all global functions defined in the LLVM / Clang source:

We can divide these functions into three categories:

1 - Utilities: For example, conversion functions from one type to another.

2 - Operators: many operators are defined, as the result of CQLinq shows:

In the llvm / clang source code there are almost all overridden operators.

3 - Functions related to compiler logic: A set of global functions that implement various compiler functions.

Perhaps this type of function should be grouped into categories, like static class methods, or grouped into namespaces.

2.1.2. Static global functions

It is best practice to declare global functions as static, except in those specific cases where you need to call them from another source file.

Almost all global functions are declared as static.

2.1.3. Global Functions - Static Candidates

Global, non-exported functions, not declared in an anonymous namespace, not used by any method outside the file where they were declared are good candidates for static refactoring.

As we can see, only a few functions can be refactored to static.

2.2. Object Oriented Paradigm

2.2.1. Inheritance

In object-oriented programming (OOP), inheritance is a way to establish an “is” relationship between objects. It is often confused with the way in which existing code is reused, which is not a good practice, because inheriting to reuse an implementation leads to a strong correlation. The ability of the code to reuse is achieved through the composition (composition is preferable to inheritance). Let's look for all classes that have at least one base class:

It is best to use Metric View in this query.

In Metric View, the codebase is represented as a treemap. This is a method of displaying a tree data structure using nested rectangles. The tree structure used in CppDepend is the usual code hierarchy:

Projects contain namespaces.
Namespaces contain types.
Types contain methods and fields.

Treemap is a useful way to present the results of CQLinq queries, blue rectangles represent the result, we can see the types associated with the query.

As we can see, inheritance is widely used in the llvm / clang source code.

Multiple Inheritance: Let's find classes inherited from more than one class.

Multiple inheritance is rarely used, less than 1% of classes are inherited from more than one class.

2.2.2. Virtual methods

Let's find all the virtual methods defined in the source code:

Many methods are virtual, some of them are pure virtual (pure virtual):

The OOP paradigm is widely used in the llvm / clang source code. What about generic programming?

2.3. Generalized programming

C ++ provides unique opportunities to express ideas of generic programming through patterns. Patterns are a form of parametric polymorphism that allows you to express generalized algorithms and data structures. The mechanism for instantiating C ++ templates ensures that when generalized algorithms and data structures are used, a fully optimized and specialized version will be created specifically for specific parameters, allowing the generalized algorithms to be as effective as their non-generic versions.

2.3.1. Generic types

Let's find all the generic types defined in the source code:

Many types are defined as generalized. Let's find the generic methods:

Less than 1% of generalized methods

So, the source code llvm / clang uses three paradigms.

3. PODs define a data model.

In object-oriented programming, plain old data (POD) is a data structure that represents only a passive collection of values, without using object-oriented functions. In computer science, they are also known as passive data structures.

Let's look for the POD types in the source code.

More than 1,500 types are defined as POD types, many of which are used to define the compiler data model.

4. Design patterns of the gang of four

Design patterns are a software engineering concept that describes solutions to common problems in software design. Gang of Four patterns are the most popular. Let's find their use in the llvm / clang source code.

4.1. Factory

List of factory methods defined in the source code:

List of abstract factory methods:

4.2. Observer

An observer pattern is a design pattern in which an object contains a list of observer objects, and automatically notifies them of any state changes, usually by calling one of their methods.

There is only one observer in the source text:

4.3. Visitor

The visitor pattern is recommended when you need to go around the structure and perform specific actions in each node of the structure.

In the llvm / clang source code, the visitor pattern is widely used:

5. Coupling and Connectivity (Coupling and Cohesion)

5.1. Clutch

A low degree of coupling is desirable, since changes in one part of the application will require fewer changes in the rest of the application. In the long run, this can save a lot of time, effort, and money associated with modifying and adding new features to the application.

A low degree of coupling can be achieved using abstract classes or using generic types and methods.

Let's find all the abstract classes defined in the source code:

More than 280 types are declared as abstract. However, a low degree of cohesion is also achieved through the use of generic types and generic methods.

Connectivity

The principle of single responsibility states that a class cannot have more than one reason for change. Such classes are called connected. A high LCOM value most often corresponds to poorly related classes. There are several LCOM metrics. LCOM takes a value in the range [0-1]. LCOM HS (HS for Henderson-Sellers) takes a value in the range [0-2]. LCOM HS value greater than 1 should be alarming. LCOM metrics are counted as:

LCOM = 1 - (sum (MF) / M * F)
LCOM HS = (M - sum (MF) / F) (M-1)

Where:

M is the number of class methods (including static methods, constructors, getters / setters, methods for adding and deleting events).
F is the number of non-static class fields.
MF is the number of class methods that have access to a specific non-static field.
Sum (MF) - the sum of the MF over all non-static fields of the class.

The idea expressed by this formula can be formulated as follows: a class is connected if all methods use all non-static fields, that is, sum (MF) = M * F, and therefore LCOM = 0 and LCOMHS = 0.

235 classes have been considered, perhaps some of them can be refactored to improve connectivity.

6. Immunity, purity and side effects

6.1. Immunable types

An object is said to be immutable if its state does not change from the moment it was created. Therefore, a class is called immutable if its instances are immutable.

There is one argument in favor of using immutable objects: it greatly simplifies competitive programming. Think about why writing operations in multi-threaded programming is so complicated? Because it is difficult to synchronize the access of threads to a resource (objects or other OS resources). Why is it difficult to synchronize access? Because it is difficult to guarantee that there will be no race between multiple streams. What if there is no write access? In other words, what if the state of the objects to which streams have access is invariable? Then there is no need to synchronize.

Another advantage of the immutable classes is that they never violate the Liskov substitution principle, here’s the definition of the Liskov principle from Wikipedia:

"the subclass should not create new mutators of the properties of the base class. If the base class does not provide methods for changing the properties defined in it, a subtype of this class should not create such methods. In other words, the immutable data of the base class should not be changeable in the subclass. "

Here is a list of immutable types in the source code:

6.2. Purity and side effects

The main advantage of immune types comes from the fact that they eliminate side effects. I can’t say this better than Wes Dyer, and I’ll quote him:

“We all know that using global variables is not a good idea. This is due to the danger of side effects (global scope). Many programmers who do not use global variables do not understand that the same principle applies to fields, properties, parameters, and variables on a more limited scale: do not change them without good reasons (...) ”

One way to increase the reliability of a module is to get rid of side effects. This makes compiling and integrating modules easier and more reliable. If they do not have side effects, they always work in the same way, regardless of the environment. This is called referential transparency.

Write functions and methods without side effects - these will be pure functions that do not change the object - this will be better in the sense of the correctness of your program.

Here is a list of all methods without side effects:

More than 100,000 methods are clean.

7. Quality of implementation

7.1. Too big methods

Methods with many lines of code are difficult to maintain and understand. Let's find methods, more than 60 lines.

The source code llvm / clang contains more than 100,000 methods, and less than 2% of them can be considered too large.

7.2. Methods with a large number of parameters

Several methods have more than 8 parameters.

7.3. Methods with multiple local variables

Less than 1% of methods have many local variables.

7.4. Too complex methods

Many metrics exist to detect complex functions, calculate the number of lines of code, the number of parameters, the number of local variables.

There are also interesting metrics for the detection of complex functions:

Cyclomatic complexity is a popular metric of procedural programming, equal to the number of decisions taken in a procedure.

Nested depth - a metric defined for the method that determines the maximum nesting depth of the scopes in the method body.

The maximum nesting cycles.

The maximum values allowed for these metrics depend more on the choice of the development team; there are no generally accepted standards.

Let's find methods that can be considered as complex.

Only 1.5% of the methods are candidates to minimize complexity.

7.5. Halstead Difficulty

Halstead complexity is a software metric introduced by Maurice Howard Halstead in 1977. Halstead made the observation that the program metric should reflect the implementation of the algorithm in different languages, but regardless of the platform. These metrics are calculated statically by code.

Halstead introduced a variety of different metrics, consider for example one of them - TimeToImplement, which indicates the time required to program the method in seconds.

2690 methods require more than an hour for their implementation.

8. RTTI

RTTI is the system's ability to report a dynamic type of an object and provide type information at runtime (and not at compile time). However, using RTTI is considered controversial in the C ++ community. Many C ++ developers do not use this mechanism.

And how does the llvm / clang development team relate to this?

No method uses the dynamic_cast keyword. The llvm / clang development team has chosen not to use the RTTI mechanism.

9. Exceptions

Exception support is another controversial feature of C ++. Many well-known open source C ++ projects do not use it.

Let's see if exceptions are thrown anywhere in the code.

Like RTTI, the exception mechanism is not used.

10. Some statistics

10.1. Most popular types

It is interesting to know which are the most used types in a project, since such types should be best designed, implemented and tested. Any changes in them will affect the project as a whole.

Find them using the TypesUsingMe metric:

There is another interesting metric for finding popular types: TypeRank.

The TypeRank value is calculated by applying the Google PageRank algorithm on the type dependency graph. A center homothety of 0.15 is applied so that the average TypeRank is equal to one.

Types with high TypeRank should be tested more carefully, because the bugs in them can be more catastrophic.

Below are the results of all popular types according to the TypeRank metric:

10.2. Most popular methods

10.3. Methods calling many other methods

It is interesting to know which methods cause many other methods; this can reveal design problems. In some cases, refactoring is required to make them more readable and easy to maintain.

Results

LLVM / Clang is very well designed and implemented, and, as for any other project, it can be improved by some refactoring. In this post we have revealed some minimal possible changes that can be made in the source code. Do not be afraid to explore the source code and improve your knowledge of C ++.

Source: https://habr.com/ru/post/350508/

All Articles