Processing preprocessor directives in Objective-C

A programming language with preprocessor directives is difficult to process, since in this case it is necessary to calculate the values of the directives, cut out the non-compiled code fragments, and then parse the cleaned code. Processing directives can be done while parsing the regular code. This article describes in detail both approaches in relation to the Objective-C language, and also reveals their advantages and disadvantages. These approaches exist not only in theory, but have already been implemented and are being used in practice in such web services as Swiftify and Codebeat.

Swiftify is a web service for converting source codes on Objective-C to Swift. At the moment, the service supports the processing of both single files and entire projects. Thus, it can save time for developers who want to learn a new language from Apple.

Codebeat is an automated system for calculating code metrics and analyzing various programming languages, including Objective-C.

Content

Introduction

Processing of preprocessor directives is carried out while parsing the code. We will not describe the basic concepts of parsing, but here we will use the terms from the article on the theory and parsing of the source code using ANTLR and Roslyn. ANTLR is used as a parser generator in both services, and the Objective-C grammars themselves are laid out in the official ANTLR grammar repository ( Objective-C grammar ).

We have identified two ways of processing preprocessor directives:

single stage processing;
two-stage processing.

Single stage processing

One-step processing involves the simultaneous parsing of directives and tokens of the main language. In ANTLR, there is a channel mechanism that allows you to isolate tokens of various types: for example, tokens of the main language and hidden tokens (comments and spaces). Directive tokens can also be placed in a separate named pipe.

Usually directive tokens begin with a pound sign ( # or sharp) and end with a line break ( \r\n ). Thus, to capture such tokens, it is advisable to have a different mode of recognition of tokens. ANTLR supports such modes, they are described like this: mode DIRECTIVE_MODE; . A fragment of the lexer with the mode section for preprocessor directives is as follows:

 SHARP: '#' -> channel(DIRECTIVE_CHANNEL), mode(DIRECTIVE_MODE); mode DIRECTIVE_MODE; DIRECTIVE_IMPORT: 'import' [ \t]+ -> channel(DIRECTIVE_CHANNEL), mode(DIRECTIVE_TEXT_MODE); DIRECTIVE_INCLUDE: 'include' [ \t]+ -> channel(DIRECTIVE_CHANNEL), mode(DIRECTIVE_TEXT_MODE); DIRECTIVE_PRAGMA: 'pragma' -> channel(DIRECTIVE_CHANNEL), mode(DIRECTIVE_TEXT_MODE);

Some of the preprocessor Objective-C directives are converted into specific Swift code (for example, using the let syntax): some remain unchanged, and the rest are converted into comments. The table below contains examples:

Objective c	Swift
`#define SERVICE_UUID @ "c381de0d-32bb-8224-c540-e8ba9a620152"`	`let SERVICE_UUID = "c381de0d-32bb-8224-c540-e8ba9a620152"`
`#define ApplicationDelegate ((AppDelegate *)[UIApplication sharedApplication].delegate)`	`let ApplicationDelegate = (UIApplication.shared.delegate as? AppDelegate)`
`#define DEGREES_TO_RADIANS(degrees) (M_PI * (degrees) / 180)`	`func DEGREES_TO_RADIANS(degrees: Double) -> Double { return (.pi * degrees)/180; }`
`#if defined(__IPHONE_OS_VERSION_MIN_REQUIRED)`	`#if __IPHONE_OS_VERSION_MIN_REQUIRED`
`#pragma mark - Directive between comments.`	`// MARK: - Directive between comments.`

Comments also need to be placed in the correct position in the resulting Swift code. However, as already mentioned, there are no hidden tokens in the parse tree.

What if you include hidden tokens in the parse tree?

Indeed, hidden tokens can be included in the grammar, but because of this it will become too complicated and redundant, since COMMENT and DIRECTIVE tokens will be contained in each rule between significant tokens:

 declaration: property COMMENT* COLON COMMENT* expr COMMENT* prio?;

Therefore, this approach can be immediately forgotten.

The question arises: how can one still extract such tokens when traversing the parse tree?

As it turned out, there are several solutions to this problem, in which hidden tokens are associated with non-terminal or terminal (end) nodes of the parse tree.

Linking hidden tokens with non-terminal nodes

This method is borrowed from the relatively old 2012 article on ANTLR 3 .

In this case, all hidden tokens are divided into sets of the following types:

preceding tokens;
subsequent tokens ( following );
orphans tokens.

To better understand what these types mean, consider a simple rule in which curly brackets are terminal characters, and as a statement can be any expression containing a semicolon at the end, for example, assignment a = b; .

 root : '{' statement* '}' ;

In this case, all comments from the following code fragment will be listed in the precending list, i.e. first token in the file or tokens in front of non-terminal nodes of the parse tree.

 /*First comment*/ '{' /*Precending1*/ a = b; /*Precending2*/ b = c; '}'

If the comment is the last one in the file, or the comment is inserted after all the statement (followed by the terminal bracket), then it is in the list following.

 '{' a = b; b = c; /*Following*/ '}' /*Last comment*/

All other comments fall into the list of orphans (they are all essentially separated by tokens, in this case with curly braces):

 '{' /*Orphan*/ '}'

Due to this splitting, all hidden tokens can be processed in the general method of Visit . This method is still used in Swiftify, but it is rather complicated and it is problematic to build a valid (fidelity) parse tree using it. The validity of the tree is that it can be converted back into character code into a character, including spaces, comments, and preprocessor directives. In the future, we plan to switch to using the method for processing preprocessor directives and other hidden tokens, which will be described below.

Linking hidden tokens to terminal nodes

In this case, hidden tokens are associated with certain significant tokens. In this case, hidden tokens can be leading (LeadingTrivia) and closing (TrailingTrivia). This method is now used in the Roslyn parser (for C # and Visual Basic), and the hidden tokens in it are called Trivia (Trivia).

All trivia on the same line from the significant token to the next significant token fall into the set of closing tokens. All other hidden tokens fall into the set of leading ones and are associated with the next significant token. The first significant token contains the initial trivii of the file. Hidden tokens closing the file are associated with the latest special end-of-file token of zero length. For more information on the types of parse tree and trivia written in the official documentation Roslyn .

In ANTLR, for a token with index i, there is a method that returns all tokens from a specific channel to the left or to the right: getHiddenTokensToLeft(int tokenIndex, int channel) , getHiddenTokensToRight(int tokenIndex, int channel) . Thus, it is possible to make the ANTLR based parser form a reliable parse tree, similar to the Roslyn parse tree.

Ignored macros

Since during one-step processing, macros are not replaced with Objective-C code fragments, they can be ignored or placed in a separate, isolated channel. This avoids problems when parsing ordinary Objective-C code and the need to include macros in grammar nodes (by analogy with comments). This also applies to default macros, such as NS_ASSUME_NONNULL_BEGIN , NS_AVAILABLE_IOS(3_0) and others:

 NS_ASSUME_NONNULL_BEGIN : 'NS_ASSUME_NONNULL_BEGIN' ~[\r\n]* -> channel(IGNORED_MACROS); IOS_SUFFIX : [_A-Z]+ '_IOS(' ~')'+ ')' -> channel(IGNORED_MACROS);

Two-stage processing

The two-step processing algorithm can be represented as the following sequence of steps:

Tokenization and parsing code preprocessor directives. Normal code snippets in this step are recognized as plain text.
Calculation of conditional directives ( #if , #elif , #else ) and the definition of compiled code blocks.
Calculation and substitution of the values of the #define directives to the appropriate places in the compiled code blocks.
Replacing directives from the source with space characters (to preserve the correct positions of the tokens in the source code).
Tokenization and parsing of the resulting text with directives removed.

The third step can be skipped, and macros can be included directly in the grammar, at least partially. However, this method is still more difficult to implement than one-step processing: in this case, after the first step, it is necessary to replace the code of the preprocessor directives with spaces if there is a need to maintain the correct positions of the tokens of the usual source code. Nevertheless, this algorithm for processing preprocessor directives was also implemented in its time and is now used in Codebeat. The grammar is laid out on GitHub along with a visitor processing preprocessor directives. An additional advantage of this method is the presentation of grammars in a more structured form.

For two-stage processing, the following components are used:

preprocessor lexer;
preprocessing parser;
preprocessor;
lexer;
parser

Recall that the lexer groups source code symbols into meaningful sequences, called lexemes or tokens. And the parser builds from a stream of tokens a connected tree structure, which is called a parse tree. Visitor (Visitor) - design pattern that allows you to make the processing logic of each tree node in a separate method.

Preprocessor lexer

A lexer separating tokens of preprocessor directives and ordinary Objective-C code. For regular code tokens, DEFAULT_MODE used, and for directive code it is DIRECTIVE_MODE . Below are the tokens from DEFAULT_MODE .

 SHARP: '#' -> mode(DIRECTIVE_MODE); COMMENT: '/*' .*? '*/' -> type(CODE); LINE_COMMENT: '//' ~[\r\n]* -> type(CODE); SLASH: '/' -> type(CODE); CHARACTER_LITERAL: '\'' (EscapeSequence | ~('\''|'\\')) '\'' -> type(CODE); QUOTE_STRING: '\'' (EscapeSequence | ~('\''|'\\'))* '\'' -> type(CODE); STRING: StringFragment -> type(CODE); CODE: ~[#'"/]+;

When looking at this code fragment, the question may arise about the need for additional tokens ( COMMENT , QUOTE_STRING and others), while for the Objective-C code only one token is used - CODE . The fact is that the # character can be hidden inside ordinary lines and comments. Therefore, these tokens must be allocated separately. But this is not a problem, since their type still changes to CODE , and the following rules exist in the preprocessing parser for separating tokens:

 text : code | SHARP directive (NEW_LINE | EOF) ; code : CODE+ ;

Preprocessor Parser

A parser that separates Objective-C code tokens and processes preprocessor directive tokens. The resulting parse tree is then passed to the preprocessor.

Preprocessor

A visitor that calculates the values of preprocessor directives. Each node traversal method returns a string. If the calculated value of the directive is true , then the next Objective-C code fragment is returned. Otherwise, the Objective-C code is replaced with spaces. As mentioned earlier, this is necessary in order to maintain the correct positions of the tokens of the main code. To make it easier to understand, let's take the following Objective-C code fragment as an example:

 BOOL trueFlag = #if DEBUG YES #else arc4random_uniform(100) > 95 ? YES : NO #endif ;

This fragment will be converted to the following code in Objective-C for a given conditional symbol DEBUG when using two-stage processing.

 BOOL trueFlag = YES ;

It should be noted that all directives and non-compiled code turned into spaces. Directives can also be nested in each other:

 #if __IPHONE_OS_VERSION_MIN_REQUIRED >= 60000 #define MBLabelAlignmentCenter NSTextAlignmentCenter #else #define MBLabelAlignmentCenter UITextAlignmentCenter #endif

Lexer

Normal Objective-C lexer without tokens that recognize preprocessor directives. If there are no directives in the source file, then the same original file arrives.

Parser

Parser ordinary Objective-C code. The grammar of this parser coincides with the grammar of the parser from one-step processing.

Other processing methods

There are other ways to handle preprocessor directives, for example, you can use a lexeless parser . Theoretically, in such a parser it will be possible to combine the advantages of both one-stage and two-stage processing, namely: the parser will calculate the values of the directives and determine the non-compiled code blocks, and in one pass. However, such parsers also have disadvantages: they are more difficult to understand and debug.

Since ANTLR is very strongly tied to the tokenization process, such solutions were not considered. Although the possibility of creating lexeless grammars now exists and will be refined in the future (see discussion ).

Conclusion

This article has examined approaches for processing preprocessor directives that can be used when parsing C-like languages. These approaches are already implemented for processing Objective-C code and are used in commercial services such as Swiftify and Codebeat. The parser with two-stage processing was tested on 20 projects, in which the number of correctly processed files is more than 95% of the total. In addition, one-step processing is also implemented for parsing C # and is laid out in Open Source: C # grammar .

Swiftify uses one-step processing of preprocessor directives, since our task is not to do the work of the preprocessor, but to translate the preprocessor directives into the appropriate Swift language constructs, despite the potential for parsing errors. For example, Objective-C #define directives are commonly used to declare global constants and macros. In Swift, constants ( let ) and functions ( func ) are used for the same purpose.

Source: https://habr.com/ru/post/318954/

All Articles