📜 ⬆️ ⬇️

How to train your dragon. Short example on clang-c

One day, sitting in the evening in front of a computer and indulging in melancholy and thoughts about the frailty of everything, I thoughtfully typed in the search for one large job search site the abbreviation LLVM, not hoping, however, to see something special there, and began to look at the poor, right let's say a catch.

As expected, almost nothing was found, but one ad interested me. It had the following lines:

"Whom we take" without looking "or the level of tasks performed:
You downloaded any open source project compiled with gcc (source code volume is more than 10 megabytes) and for the largest cpp file you could build an AST tree using clang with –fsyntax-only;
You downloaded any open source project compiled using Visual C ++ (source code volume is more than 10 megabytes) and for the largest cpp file you could build an AST tree using clang with –fsyntax-only;
You were able to write a utility that will allocate all the places for declarations and the use of local variables, as well as all functions not defined in this file . ”
')
Well, I thought, some kind of entertainment for the evening.



We will take a brief look at the first two points; everything is very simple there.

How to build AST


We take any project in c ++, you can clang itself (it is built on both gcc and VC ++).

clang -std=c++11 -Xclang -ast-dump ////cpp -I/////include/ -D_ -fsyntax-only 

We get AST in text form. For a large AST file has a huge size, I will not give the whole ast-dump here, but for clarity I will give a small piece:

AST fragment of the clang itself
TranslationUnitDecl 0x576e190 <<invalid sloc>> <invalid sloc>
|-TypedefDecl 0x576e718 <<invalid sloc>> <invalid sloc> implicit __int128_t '__int128'
| `-BuiltinType 0x576e400 '__int128'
|-TypedefDecl 0x576e778 <<invalid sloc>> <invalid sloc> implicit __uint128_t 'unsigned __int128'
| `-BuiltinType 0x576e420 'unsigned __int128'
|-TypedefDecl 0x576eaa8 <<invalid sloc>> <invalid sloc> implicit __NSConstantString 'struct __NSConstantString_tag'
| `-RecordType 0x576e860 'struct __NSConstantString_tag'
| `-CXXRecord 0x576e7c8 '__NSConstantString_tag'
|-TypedefDecl 0x576eb38 <<invalid sloc>> <invalid sloc> implicit __builtin_ms_va_list 'char *'
| `-PointerType 0x576eb00 'char *'
| `-BuiltinType 0x576e220 'char'
|-TypedefDecl 0x576ee58 <<invalid sloc>> <invalid sloc> implicit referenced __builtin_va_list 'struct __va_list_tag [1]'
| `-ConstantArrayType 0x576ee00 'struct __va_list_tag [1]' 1
| `-RecordType 0x576ec20 'struct __va_list_tag'
| `-CXXRecord 0x576eb88 '__va_list_tag'
|-NamespaceDecl 0x57cc578 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:18:1, line:31:1> line:18:11 clang
| |-CXXRecordDecl 0x57cc5e0 <line:20:1, col:7> col:7 class Decl
| |-CXXRecordDecl 0x57cc6a0 <line:21:29, <scratch space>:2:1> col:1 referenced class AccessSpecDecl
| |-CXXRecordDecl 0x57cc760 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:21:29, <scratch space>:3:1> col:1 class BlockDecl
| |-CXXRecordDecl 0x57cc820 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:21:29, <scratch space>:4:1> col:1 class CapturedDecl
| |-CXXRecordDecl 0x57cc8e0 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:21:29, <scratch space>:5:1> col:1 referenced class ClassScopeFunctionSpecializationDecl
| |-CXXRecordDecl 0x57cc9a0 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:21:29, <scratch space>:6:1> col:1 class EmptyDecl
| |-CXXRecordDecl 0x57cca60 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:21:29, <scratch space>:7:1> col:1 class ExternCContextDecl
| |-CXXRecordDecl 0x57ccb20 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:21:29, <scratch space>:8:1> col:1 class FileScopeAsmDecl
| |-CXXRecordDecl 0x57ccbe0 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:21:29, <scratch space>:9:1> col:1 referenced class FriendDecl
| |-CXXRecordDecl 0x57ccca0 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:21:29, <scratch space>:10:1> col:1 referenced class FriendTemplateDecl
| |-CXXRecordDecl 0x57ccd60 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:21:29, <scratch space>:11:1> col:1 class ImportDecl
| |-CXXRecordDecl 0x57cce20 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:21:29, <scratch space>:12:1> col:1 class LinkageSpecDecl
| |-CXXRecordDecl 0x57ccee0 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:21:29, <scratch space>:13:1> col:1 class NamedDecl
| |-CXXRecordDecl 0x57ccfa0 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:21:29, <scratch space>:14:1> col:1 class LabelDecl
| |-CXXRecordDecl 0x57cd060 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:21:29, <scratch space>:15:1> col:1 class NamespaceDecl
| |-CXXRecordDecl 0x57cd120 </home/user/LLVM/llvm-3.7.1.src/tools/cfe-3.7.1.src/include/clang/AST/ASTFwd.h:21:29, <scratch space>:16:1> col:1 class NamespaceAliasDecl


You can also get a tree with another option: -ast-print. It will also be text and huge, and I will not give it either.

Finally, you can get a graphical representation of the tree in Graphviz (widely used for debugging in LLVM). This is done using the -ast-view option. Of course, this should set up Graphviz and set the paths to the 'dot' and 'gv' files. In this case, many windows will open, each of which will have a small section of AST, for example, like this:



However, before continuing, I would like to briefly tell you what AST is and why it is needed. Readers with a degree in Computer Science can safely browse through the next section, and others may be interested.

Abstract Syntax Tree


Strict definitions you can see in Wikipedia, and I will try to explain "on the fingers."
An abstract syntax tree is a structure with which the compiler presents the source code of a program in a form convenient for further compilation. There are variables in the leaves of the tree (to be exact, references to variable declarations) and constants, in other vertices - operators, data type declarations, etc. The root of the tree is the “translation unit” of the program.

For example, a simple program:

 // file foo.h void foo(int x, int y); // file main.c #include "foo.h" typedef struct { int x, y; } st_coord; int main() { st_coord coord; foo(coord.x, coord.y); } 

It generates an AST:

Many letters
TranslationUnitDecl 0x4e64ad0 << invalid sloc >> <invalid sloc>
| -TypedefDecl 0x4e65018 << invalid sloc >> implicit __int128_t '__int128'
| `-BuiltinType 0x4e64d40 '__int128'
| -TypedefDecl 0x4e65078 << invalid sloc >> implicit __uint128_t 'unsigned __int128'
| `-BuiltinType 0x4e64d60 'unsigned __int128'
| -TypedefDecl 0x4e65338 << invalid sloc >> implicit __NSConstantString 'struct __NSConstantString_tag'
| `-RecordType 0x4e65150 'struct __NSConstantString_tag'
| `-Record 0x4e650c8 '__NSConstantString_tag'
| -TypedefDecl 0x4e653c8 << invalid sloc >> implicit __builtin_ms_va_list 'char *'
| `-PointerType 0x4e65390 'char *'
| `-BuiltinType 0x4e64b60 'char'
| -TypedefDecl 0x4e65678 << invalid sloc >> implicit __builtin_va_list 'struct __va_list_tag [1]'
| `-ConstantArrayType 0x4e65620 'struct __va_list_tag [1]' 1
| `-RecordType 0x4e654a0 'struct __va_list_tag'
| `-Record 0x4e65418 '__va_list_tag'
| -FunctionDecl 0x4ebc390 </home/user/llvm3.9/foo.h:1:1, col: 22> col: 6 used foo 'void (int, int)'
| | -ParmVarDecl 0x4e656d8 <col: 10, col: 14> col: 14 x 'int'
| `-ParmVarDecl 0x4e65748 <col: 17, col: 21> col: 21 y 'int'
| -RecordDecl 0x4ebc488 </home/user/llvm3.9/test.c.06:9, line: 5: 1> line: 3: 9 struct definition
| | -FieldDecl 0x4ebc540 <line: 4: 2, col: 6> col: 6 referenced x 'int'
| `-FieldDecl 0x4ebc598 <col: 2, col: 9> col: 9 referenced y 'int'
| -TypedefDecl 0x4ebc630 <line: 3: 1, line: 5: 3> col: 3 referenced st_coord 'struct st_coord': 'st_coord'
| `-ElaboratedType 0x4ebc5e0 'struct st_coord' sugar
| `-RecordType 0x4ebc510 'st_coord'
| `-Record 0x4ebc488 ''
`-FunctionDecl 0x4ebc6e8 <line: 7: 1, line: 10: 1> line: 7: 5 main 'int ()'
`-CompoundStmt 0x4ebc9c8 <col: 12, line: 10: 1>
| -DeclStmt 0x4ebc820 <line: 8: 2, col: 16>
| `-VarDecl 0x4ebc7c0 <col: 2, col: 11> col: 11 used coord 'st_coord': 'st_coord'
`-CallExpr 0x4ebc960 <line: 9: 2, col: 22> 'void'
| -ImplicitCastExpr 0x4ebc948 <col: 2> 'void (*) (int, int)' <FunctionToPointerDecay>
| `-DeclRefExpr 0x4ebc838 <col: 2> 'void (int, int)' Function 0x4ebc390 'foo' 'void (int, int)'
| -ImplicitCastExpr 0x4ebc998 <col: 6, col: 12> 'int' <LValueToRValue>
| `-MemberExpr 0x4ebc888 <col: 6, col: 12> 'int' lvalue .x 0x4ebc540
| `-DeclRefExpr 0x4ebc860 <col: 6> 'st_coord': 'st_coord' lvalue Var 0x4ebc7c0 'coord' 'st_coord': 'st_coord'
`-ImplicitCastExpr 0x4ebc9b0 <col: 15, col: 21> 'int' <LValueToRValue>
`-MemberExpr 0x4ebc8e8 <col: 15, col: 21> 'int' lvalue .y 0x4ebc598
`-DeclRefExpr 0x4ebc8c0 <col: 15> 'st_coord': 'st_coord' lvalue Var 0x4ebc7c0 'coord' 'st_coord': 'st_coord'

We use clang-c API


In the clang compiler, each AST node is represented by an object of a particular class, with only three base classes: clang :: Decl (declaration class), clang :: Stmt, which includes all operators, and the clang :: Type class, the data type class.

So, clang is written in C ++ and has an object-oriented API. You can use it to write various utilities and tools using clang. However, knowledgeable people prefer another API, clang-c, which is a wrapper over the clang API, written in pure C. The meaning is simple: first, it is simpler, and secondly, the clang-c API is stable, unlike the clang API, which changes with each release. Finally, the use of clang-c does not exclude the use of the clang API, which we will see shortly.



AST tree traversal


The first thing to do is to write a deep tree traversal. This is a completely standard operation:

 #include <clang-c/Index.h> #include <iostream> #include <string> using namespace clang; void printCursor(CXCursor cursor) { CXString displayName = clang_getCursorDisplayName(cursor); std::cout << clang_getCString(displayName) << "\n"; clang_disposeString(displayName); } CXChildVisitResult visitor( CXCursor cursor, CXCursor /* parent */, CXClientData /*clientData*/ ) { CXSourceLocation location = clang_getCursorLocation( cursor ); if( clang_Location_isFromMainFile( location ) == 0 ) return CXChildVisit_Continue; printCursor(cursor); clang_visitChildren( cursor, visitor, nullptr ); return CXChildVisit_Continue; } int main (int argc, char** argv) { CXIndex index = clang_createIndex ( 0, // excludeDeclarationFromPCH 1 // displayDiagnostics ); CXTranslationUnit unit = clang_parseTranslationUnit ( index, // CIdx 0, // source_filename argv, // command_line_args argc, // num_command_line_args 0, // unsave_files 0, // num_unsaved_files CXTranslationUnit_None // options ); if (!unit) { std::cout << "Translation unit was not created\n"; } else { CXCursor root = clang_getTranslationUnitCursor(unit); clang_visitChildren(root, visitor, nullptr); } clang_disposeTranslationUnit(unit); clang_disposeIndex(index); } 

Here, everything is very clear: clang_parseTranslationUnit is a function that performs all compilation steps before building an AST inclusive. Any compilation options can be passed to it. In this case, the file name can be passed either in arguments or directly (source_filename). The source text can be transferred not only as a file, but also as a CXUnsavedFile structure representing the text in memory. After the parsing, the tree is traversed in depth. For each vertex, the visitor function is called, to which CXCursor is passed - a structure representing the top of the tree. Also, a visitor parameter CXClientData representing arbitrary user data may be passed to the visitor function.

We write the visitor function


Let's try to find all the local variables of the program.
  CXCursorKind cursorKind = clang_getCursorKind( cursor ); // finding local variables if(clang_getCursorKind(cursor) == CXCursor_VarDecl) { if(const VarDecl* VD = dyn_cast_or_null<const VarDecl>(getCursorDecl(cursor))) { if( VD->isLocalVarDecl()) { std::cout << "local variable: "; printCursor(cursor); } } } 

Here, too, everything is simple: CXCursor_VarDecl - the cursor points to a variable. dyn_cast_or_null - type conversion pattern in LLVM.

LLVM and RTTI
LLVM does not use RTTI and the usual dynamic_cast will not work.
The following patterns are used for type casting in LLVM:
isa <B> (A) - check that the object A is of type B.
cast <B> (A) - conversion of object A to type B. Verification of the belonging of type A to type B is not performed. A check for nullptr is not performed.
cast_or_null <B> (A) - the conversion of object A to type B. Verification of type A belonging to type B is not performed. If A == nullptr, the result will be nullptr.
dyn_cast <B> (A) - conversion of object A to type B with type checking. A check for nullptr is not performed.
dyn_cast_or_null <B> (A) - conversion of object A to type B with type checking. If A == nullptr, the result will be nullptr.
How to use the RTL LLVM implementation in your classes is written here.

Next, we convert the cursor to an instance of the VarDecl class, and check whether the variable is local. If so, we display the name of the cursor and its location in the source code, using auxiliary functions for this:

 //logging functions std::string getLocationString(CXSourceLocation Loc) { CXFile File; unsigned Line, Column; clang_getFileLocation(Loc, &File, &Line, &Column, nullptr); CXString FileName = clang_getFileName(File); std::ostringstream ostr; ostr << clang_getCString(FileName) << ":" << Line << ":" << Column; clang_disposeString(FileName); return ostr.str(); } void printCursor(CXCursor cursor) { CXString displayName = clang_getCursorDisplayName(cursor); std::cout << clang_getCString(displayName) << "@" << getLocationString(clang_getCursorLocation(cursor)) << "\n"; clang_disposeString(displayName); } 

To find the Decl, Expr and Stmt values, we use auxiliary functions:

 // extracted from CXCursor.cpp const Decl *getCursorDecl(CXCursor Cursor) { return static_cast<const Decl *>(Cursor.data[0]); } const Stmt *getCursorStmt(CXCursor Cursor) { if (Cursor.kind == CXCursor_ObjCSuperClassRef || Cursor.kind == CXCursor_ObjCProtocolRef || Cursor.kind == CXCursor_ObjCClassRef) return nullptr; return static_cast<const Stmt *>(Cursor.data[1]); } const Expr *getCursorExpr(CXCursor Cursor) { return dyn_cast_or_null<Expr>(getCursorStmt(Cursor)); } 

Further we look for all uses of local variables:

 // finding referenced variables if(cursorKind == CXCursor_DeclRefExpr) { if(const DeclRefExpr* DRE = dyn_cast_or_null<const DeclRefExpr>(getCursorExpr(cursor))) { if(const VarDecl* VD = dyn_cast_or_null<const VarDecl>(DRE->getDecl())) { if(VD->isLocalVarDecl()) { std::cout << "reference to local variable: "; printCursor(cursor); } } } } 

And finally, we find all calls to functions that are not defined in this file:

  // finding functions not defined in the module if(cursorKind == CXCursor_CallExpr) { if (const Expr *E = getCursorExpr(cursor)) { if(isa<const CallExpr>(E)) { CXCursor Definition = clang_getCursorDefinition(cursor); if (clang_equalCursors(Definition, clang_getNullCursor())) { std::cout << "function is not defined here: "; printCursor(cursor); } } } } 

Here we check whether the cursor is a function call (CXCursor_CallExpr). However, it should be noted that CXCursor_CallExpr is not only a function call, it is also a call to the constructor, destructor and method, so additional check (isa) is needed. After that, we look for the definition of the function (clang_getCursorDefinition), and if we do not find (clang_equalCursors (Definition, clang_getNullCursor ())), then we have found a function that is not defined in this file.


Test


For the test, we will write two simple programs, one for C, one for C ++.
So, the C program:

 //file func.h void foo_ext(int x); //file simple.c #include "func.h" int global1; int foo(int x) { return x; } int global2; int main(int arg) { int local; local = arg; foo_ext(arg); return foo(local); } 

Run our utility, we get at the output:

 local variable: local@simple.c:13:9 reference to local variable: local@simple.c:14:5 function is not defined here: foo_ext@simple.c:15:5 reference to local variable: local@simple.c:16:16 

It seems that's right. Now let's check on the file in C ++:

 #include "func.h" class MyClass { public: MyClass() { int SomeLocal_1; } void foo() { int SomeLocal_2; } ~MyClass() { int SomeLocal_3; } }; MyClass myClass_global; int foo(int x) {return 0;} int main(int argc, char** argv) { int local; MyClass myClass_local; foo(argc); foo_ext(local); return 1; } 

We get at the output:

 local variable: SomeLocal_1@cpptest.cpp:6:13 local variable: SomeLocal_2@cpptest.cpp:9:13 local variable: SomeLocal_3@cpptest.cpp:12:13 local variable: local@cpptest.cpp:22:9 local variable: myClass_local@cpptest.cpp:23:13 function is not defined here: foo_ext@cpptest.cpp:25:5 reference to local variable: local@cpptest.cpp:25:13 

OK, it seems it works.

How can this be used?


The wide range of clang features can be used for various purposes, which include analyzing and converting source code in C, C ++ and Objective C.

Still
You can also use it to search for work, but while I was writing all this, the ad disappeared from the site. Alas.

Literature


List of sources on the topic:

1. Project code on Gihub .
2. http://bastian.rieck.ru/blog/posts/2015/baby_steps_libclang_ast/
3. http://bastian.rieck.ru/blog/posts/2016/baby_steps_libclang_function_extents/
4. https://jonasdevlieghere.com/understanding-the-clang-ast/
5. https://habrahabr.ru/post/148508/

Source: https://habr.com/ru/post/320074/


All Articles