Now we can say with confidence that the times of self-written C ++ parsers are gradually passing away. Clang - a full-fledged C ++ - fronrend and compiler, which provides its users with a rich API, is slowly and inexorably entering the scene. Using this API, you can parse the source text to C / C ++ / Objective C, and pull out all the necessary information from it - from the simple lexical meaning of tokens, to the symbol table, AST-trees and the results of static code analysis for all sorts of different problems. In conjunction with llvm and with a strong desire, C ++ can be used as a scripting language, parsing and executing C ++ programs on the fly. In general, the opportunities for programmers are rich, you just need to understand how to use them correctly. And then, as often happens, the fun begins.
1. Clang or clang-c?
We must start with the fact that the developers of clang provide their clients with two types of API. The first is completely “plus”, but ... potentially unstable (in the sense that it can vary from version to version). The second is guaranteed to be stable, but ... pure "sishnoe." The choice in favor of one or the other should be made according to the situation, and based on the needs of the product developed on the basis of clang.
1.1 clang-c API
In the clang source tree, the implementation of this library is located in the tools branch (the implementation of the clang kernel itself is located in the lib). This library is compiled into a dynamically loadable module, and its interface provides the client with a number of guarantees:
- Stability and backward compatibility. The client can safely go for himself (that is, for his code) from one version of clang to another, without fear that something will fall off or, worse, stop gathering.
- It is possible to determine the capabilities of the used clang implementation in runtime, and adjust to them.
- High resiliency - fatal errors in the clang kernel will not lead to client crash.
- Own flow control for heavyweight activities (like parsing).
- There is no need to compile the front-end itself, since all the functionality of the compiler and API is assembled as one dynamically loaded library.
But, for the listed advantages it is necessary to pay. Therefore, the clang-c API has the following set of disadvantages:
- Design in accordance with the ship-in-bottle pattern. The entities with which the client of this API interacts are, in essence, wrappers over the original classes provided by the clang API.
- (as a result) Manual resource management. For convenient use from C ++ code, it is necessary to write wrappers providing RAII.
- Very "narrow" interface. The client is provided with a small set of C-methods and types through which it interacts with the kernel.
- (as a result) rather poor set of functionality. Many of the tools provided by the clang API are simply not available to the client, or are provided in a reduced form.
It makes sense to use this API option in cases where the existing set of “pluses” is essential for client code. Well, or "cons" are not so principled. This API is quite suitable for extracting semantic information from the source text (both in the form of AST and in the form of the semantic load of each specific token in the source text), its infrequent indexing, on-the-fly verification with the collection of all diagnostics, etc. tasks. Accordingly, it is suitable for various kinds of standalone translators, meta-information generators, static analyzers and code verifiers, etc.
And, in turn, this API is poorly suited for tasks that require increased performance, or more dense interaction with the compiler core.
1.2 clang API
This version of the API is essentially the interface of the compiler core itself. This API is purely C ++, and provides wide access to all features of the clang kernel. Its advantages include:
')
- As already mentioned, direct and convenient access to all features of the compiler.
- Convenient (at least compared to clang-c) interface.
- A large number of all kinds of customizations.
- Slightly higher performance (compared to clang-c).
And, partly as a result, disadvantages:
- Client security from possible drops inside the kernel.
- No warranty backward compatibility on the interface.
- Delivery in the form of static libraries. The client is forced to link directly to the kernel and, as a result, build clang and llvm for its configuration.
- "Verbosity." In a number of source code scenarios (compared to the clang-c API), this is more.
- Not everything is documented.
- High degree of connectivity with llvm API. Without llvm, you cannot use clang. With this you just need to accept.
How significant are the disadvantages, and whether they outweigh the advantages - it is necessary to decide on the situation. In my opinion, this use case of clang should be chosen everywhere, good performance is required, or access to specific features not available through clang-c. In particular, when using clang as an on-the-fly-parser for IDE, it makes sense to use this particular API version.
2. Getting started, or source parsing
Why then, one wonders, is this clang needed at all? Indeed, parsing the source text in order to extract information from the developer or to turn it into a byte-code is the main task of the clang frontend. And to solve this problem, clang provides rich opportunities. Probably too rich. Frankly, I was slightly taken aback when I first opened one of the examples for clang - the manipulations it produced seemed to me from the field of black magic, because they did not coincide with the intuitive ideas about how this parsing should look like. In the end, everything turned out to be quite logical, although the installation of the parsing options by passing an array of strings describing the command line arguments still discourages me.
2.1 Parsing with clang-c
If my acquaintance with clang would start with an example built on the basis of this API, there would be less surprise. In fact, file parsing is done in two calls. The first creates an instance of the CXIndex object, the second - initiates the actual parsing of the source text and the construction of the AST. Here is what it looks like in the source code:
#include <iostream> #include <clang-c/Index.h> int main (int argc, char** argv) { CXIndex index = clang_createIndex ( false, // excludeDeclarationFromPCH true // displayDiagnostics ); CXTranslationUnit unit = clang_parseTranslationUnit ( index, // CIdx "main.cpp", // source_filename argv + 1 , // command_line_args argc - 1 , // num_command_line_args 0, // unsave_files 0, // num_unsaved_files CXTranslationUnit_None // options ); if (unit != 0 ) std::cout << "Translation unit successfully created" << std::endl; else std::cout << "Translation unit was not created" << std::endl; clang_disposeTranslationUnit(unit); clang_disposeIndex(index); }
The first method (
clang_createIndex ) creates a context within which instances of translation units (
CXTranslationUnit ) will be created and parsed. It takes two parameters. The first (
excludeDeclarationsFromPCH ) controls the visibility of ads read from the precompiled header during the crawling process of the received AST. A value of 1 means that such ads will be excluded from the final AST. The second parameter (
displayDiagnostics ) controls the
display of diagnostics obtained during the translation process to the console.
The second method (
clang_parseTranslationUnit ) performs the actual parsing of the file with the source text. This method has the following parameters:
- CIdx is a pointer to the context created by calling clang_createIndex.
- source_filename - path to the file to be parsed.
- command_line_args - command line arguments that will be converted to compiler options.
- num_command_line_args - the number of arguments in the command line passed as the previous parameter.
- unsaved_files - a collection of files whose actual contents is in memory, and not on disk.
- num_unsaved_files - the number of elements in the collection of unrecorded files.
- options - additional parsing options.
As you can see, the entire configuration of the parser is done by passing the command line arguments to the parser in text form. The
unsaved_files parameter
is useful in clang use scenarios from editors or IDEs. Through it, you can transfer to the parser those files that have been modified by the user but have not yet been saved to disk. This is a collection of structures of type
CXUnsavedFile , containing the file name, its contents, and the size of the content in bytes. The name and content are specified as C-lines, and the size as an unsigned integer.
The last parameter (
options ) is a set of the following flags:
- CXTranslationUnit_None - everything is obvious. No special parsing options are set.
- CXTranslationUnit_DetailedPreprocessingRecord - setting this option indicates that the parser will have to generate detailed information about how and where the preprocessor is used in the source text. As is clear from the documentation, the option is rarely used, leads to the consumption of a large amount of memory, and it should be installed only in cases where such information is really required.
- CXTranslationUnit_Incomplete — setting this option indicates that an incomplete (not completed) translation unit is being processed. For example, the header file. In this case, the translator will not attempt to instantiate templates that should have been instantiated before the translation is completed.
- CXTranslationUnit_PrecompiledPreamble - setting this option indicates that the parser should automatically create a precompiled header for all header files that are included at the beginning of the translation unit. This option is useful if the file will be repaired frequently (using the clang_reparseTranslationUnit method), but with its own features, which will be described in the next section.
- CXTranslationUnit_CacheCompletionResults - setting this option causes that after each subsequent reparsing some of the code completion results will be saved.
- CXTranslationUnit_SkipFunctionBodies — setting this option causes the body of functions and methods to not be processed during the translation. Useful for quick search of ads and definitions of certain characters.
Flags can be combined using the '|' operation.
The last two methods (
clang_disposeTranslationUnit and
clang_disposeIndex ) remove previously created handles that describe the translation unit and context.
To successfully build this sample code, simply add the libclang library.
2.1 Parsing with the clang API
Similar (by functionality) code using the clang API looks like this:
#include <vector> #include <iostream> #include <clang/Basic/Diagnostic.h> #include <clang/Frontend/DiagnosticOptions.h> #include <clang/Frontend/CompilerInstance.h> #include <clang/Frontend/CompilerInvocation.h> #include <clang/Frontend/Utils.h> #include <clang/Frontend/ASTUnit.h> int main(int argc, char ** argv) { using namespace clang ; using namespace llvm ; // Initialize compiler options list std::vector< const char *> args; for (int n = 1; n < argc; ++ n) args.push_back(argv[n]); args.push_back("main_clang.cpp" ); const char** opts = &args.front(); int opts_num = args.size(); // Create and setup diagnostic consumer DiagnosticOptions diagOpts; IntrusiveRefCntPtr< DiagnosticsEngine> diags(CompilerInstance::createDiagnostics( diagOpts, // Opts opts_num, // Argc opts, // Argv 0, // Client true, // ShouldOwnClient false // ShouldCloneClient )); // Create compiler invocation IntrusiveRefCntPtr< CompilerInvocation> compInvoke = clang::createInvocationFromCommandLine( makeArrayRef(opts, opts + opts_num), // Args diags // Diags ); if (!compInvoke) { std::cout << "Can't create compiler invocation for given args" ; return -1; } // Parse file clang::ASTUnit *tu = ASTUnit ::LoadFromCompilerInvocation( compInvoke.getPtr(), // CI diags, // Diags false, // OnlyLocalDecls true, // CaptureDiagnostics false, // PrecompilePreamble TU_Complete, // TUKind false // CacheCodeCompletionResults ); if (tu == 0 ) std::cout << "Translation unit was not created" ; else std::cout << "Translation unit successfully created" ; return 0; }
There are much more letters in it, and assembly requires the following set of libraries:
clangLex, clangBasic, clangAST, clangSerialization, clangEdit, clangAnalysis, clangFrontend, clangSema, clangDriver, clangParse, LLVMCO,,,,,, llanMeSema When building under Windows, you also need to add advapi32 and shell32. But the output will be an executable module without unnecessary external dependencies.
The above code can be divided into four parts:
- Creating a collection of command line parameters for the compiler. In this API version, the path to the file that needs to be parsed is also passed as one of the elements of the collection; therefore, in this case, argv and argc cannot be transmitted directly.
- Create an instance of the Diagnostic Engine. The object of this class is responsible for collecting and storing all error messages, warnings and other diagnostics that can be generated by the parser in the process of parsing the source text.
- Creating an instance of the Compiler Invocation.
- Actually parsing the source text.
Creating a collection of command line arguments
As I wrote above, clang operation options are set by passing to the appropriate classes a collection of strings describing these settings. Strings are transmitted as an array of pointers and it is most convenient to do this by means of an intermediate vector. In this case, you can add to your arguments from the outside any number of your own. In particular, the name of the file to be parsed.
Creating a Diagnostic Engine
Creating a DE is necessary in order to get various diagnostic information from the clang parser, which it generates when parsing the source text. Parameters such as the maximum number of displayed errors, which errors / warnings to display, etc. DE takes from the command line, which are transmitted by the second and third parameters. The last three parameters describe the “diagnostic client”. This is a special class to which DE will transmit parser messages (as they arise) for further processing in a user-specific clang manner. DE can take control of the client’s lifetime for itself, or work with a clone of the transferred object. This allows you to use different client implementation scenarios - in the form of a static / automatic object, in the form of an object on the heap, as part of the class, in whose methods you work with the clang API, etc.
Creating a Compiler Invocation
At this step, in fact, a context is created within which parsing will be performed. All parameters of the transmitted command line, environment variables are analyzed, the entire internal infrastructure is created (in accordance with these parameters), the Diagnostic Engine is connected. After that, the clang is fully ready to parse the file that was passed as the last parameter.
Source parsing
It is implemented by calling one of the static methods of the clang: ASTUnit class. There are several such methods, they are sharpened for different scenarios. The example shows one of the possible options. In this case, the compiler invocation instance is passed to the parser (the parser will then delete it itself!), The Diagnostic Engine instance (its parser will not be automatically deleted), and several parameters controlling the behavior of the parser:
- OnlyLocalDecls - only the declarations from the translation unit that parsed will be included in the final AST. Declarations from PCH and attached header files will be excluded.
- CaptureDiagnostic - controls the way diagnostics are collected. If this parameter is set to false, then all collected diagnostics will be transferred to the diagnostic client specified when the Diagnostic Engine was created. Otherwise, the diagnostics will be stored in the internal structures of ASTUnit.
- PrecompilePreamble - as mentioned above, when this option is enabled, the parser will automatically create PCH for all included headers in the source code. Yes indeed. Useful when repeating parsing. But, as it turned out, there are some not quite pleasant moments. First, PCH is actually created by the first call to the ASTUnit :: Reparse method for the received ASTUnit instance. Secondly, in case if the header file with # ifdef-guards is parsed, then, alas, nothing will be created.
- TUKind - type of translation unit. Here are the following options:
- TU_Complete - the fully completed translation unit is parsed. In this case, all template instances used in the source text will also be placed in the final AST.
- TU_Prefix - parse the “prefix for the translation unit”, in this case the source text is not considered complete.
- TU_Module - a certain "module" is parsed. What it is - the documentation is silent.
- CacheCodeCompletionResults - the code completion results will be cached during parsing. Really helps with subsequent requests code completion.
3. Little tricks in the set of options
In my first experiments (it was the parsing of header files for extracting declarations), I didn’t understand the reason why parsing ended with a lot of errors. In the end, everything was quite simple. So, options that may be useful:
- -x language - indicates the specific type of file that is being parsed. Compatible with the similar gcc compiler option.
- -std = standard - specifies the standard to which the source text corresponds. By value compatible with the similar compiler option gcc.
- -ferror-limit = N - sets the maximum number of errors in N, after which the parsing will be completed. If you want to parse the file completely ignoring any errors, then N must be equal to 0.
- -include <prefix-file> - specifies the file (usually the header file) that should be parsed before parsing the main file. In general, this option is originally intended for connecting a PCH header, but when parsing files it may be useful, for example, to define various macros.
At this first acquaintance with the clang API can be considered complete. You can read more about the clang-c API on the official clang website:
clang.llvm.org/doxygen/group__CINDEX.htmlThere you can also get acquainted with the entire class hierarchy of clang API. Unfortunately, the documentation is automatically generated from the upstream clang, so the function signatures, their set, etc., described on the documentation site, may differ from those presented in this or that release.
In the next article I will discuss how you can get a declaration tree from the AST created with clang.