📜 ⬆️ ⬇️

Parsing Python code using Flex and Bison

Introduction


For about two years I have been participating in the SourceAnalyzer OpenSource project, and now I have to write a parser for the Python language, which should be able to build a call graph (Call Graph) and a class dependency graph (Class Graph Dependency). More precisely, the graph is constructed using other tools, and the parser only needs to prepare data for these tools.

The process of working on the parser was quite entertaining and I would like to share with you the experience gained, as well as tell about some of the pitfalls that met during the development phase.

Terminology


If you have already worked with grammars and know how the compiler works, feel free to step into the next paragraph, the rest is Welcome.

The parser development process was divided into two main components:

Let's look at a little example. Let there is an expression (or input data stream):
')
3 + 4 * 15 


Construct the rules for converting the input stream:

 [1-9]* -> NUMBER + -> PLUS * -> MULT 


Thus, after lexical analysis we get:
 NUMBER(3) PLUS(+) NUMBER(4) MULT(*) NUMBER(15) EQUAL(=) 


The following is an example of a grammar, with the help of which you can later parse the expression using the rules:

 exp: NUMBER | exp sign exp sign: PLUS | MULT * / \ + 15 / \ 3 4 


In this example, NUMBER, MULT, PLUS - by definition, terminals, or tokens, defined at the lexical analysis stage. expr, sign are not terminals, since they are constituent units.

This introduction is not in any way exhaustive, therefore the theory should be addressed to the book Flex & Bison O'Relly.

Grammar


The full grammar for the Python language (version 2.5.2) can be found here:
http://docs.python.org/release/2.5.2/ref/grammar.txt

My task was only to identify the definition of classes, functions, and function calls.

To begin, let us describe the necessary part of the grammar using the extended Backus-Naur form (RBNF) ( wiki ).

 class_def = CLASS classname [inheritance] COLON suite classname = ID inheritance = LBRACE class_args_list RBRACE class_args_list = [class_arg] class_arg = dotted_name {COMMA dotted_name} dotted_name = ID {DOT ID} 

Here [X] is 0 or 1 occurrences of X , {Y} - 0 or more occurrences of Y

To determine the class name and its dependencies is enough. Now for the functions.

 func_def = DEF funcname LBRACE [func_args_list] RBRACE COLON suite funcname = ID func_args_list = [func_arg] func_arg = (dotted_name | star_arg) {OTHER | COMMA | dotted_name | star_arg | MESSAGE} star_arg = [STAR] STAR ID 

By the way, it is assumed that the user code will be correct (from the point of view of the interpreter), otherwise the rules of grammar must be defined more strictly.

Well, let's finish with the grammar and go to the lexer (lexical analyzer), since the source python code should be divided into tokens before parsing the grammar.

Lexical analyzer


Lexer is generated by the Flex program. The simplest example is:
 %{ #include <stdio.h> %} %% start printf("START\n"); stop printf("STOP\n"); . ; /*    */ %% 

How to compile this example:

 % flex lexer.l && gcc lex.yy.c -o lexer -lfl 


Learn to describe grammar for lexer:
http://rus-linux.net/lib.php?name=/MyLDP/algol/lex-yacc-howto.html

OK, now we will define tokens. In grammar, we have already used the following:
CLASS, COLON, COMMA, DEF, DOT, ID, LBRACE, MESSAGE, RBRACE, OTHER, STAR
We also need DEFINED - reserved Python words.

We make lexer.
Code: https://gist.github.com/2158334

A brief analysis: comments, blank lines and spaces are ignored. All the rest (the so-called stream of tokens) is given to the input Bison.

A character set that finds a pattern (for example, following the [ \t]+ pattern) is placed in yytext . By default, yytext is a char pointer, but if you add the -l yytext when compiling, then yytext will be interpreted as a char array. Before returning the token to the bison, we write the value determined from the template into the yylval (for more details, see below).

It's time to go to the description of grammar for bison.

Bison


Learn to describe the grammar for bison: http://rus-linux.net/lib.php?name=/MyLDP/algol/lex-yacc-howto.html

I refer you again to this article, because those who cannot compose a grammar for a bison will easily learn through the material on this link. Well, who knows how - well, go ahead.

So, looking at the point of the Grammar article, let's try to make a grammar for the bison:

Code: https://gist.github.com/2158315

In the Bison rule, each lexeme has a meaning. The value of the group collected by the current rule is stored in $ , the values ​​of the other tokens are stored in $1, $2, …
 test_rule: token1, token2, token3; $$ $1 $2 $3 

The value stored in $n is nothing more than the value of the yylval at the time the lexer returns a token.

Starting Bison with the -d parameter, the file .tab.h generated, it contains the macro definitions of the types of tokens that are used in the lexer. Each token corresponds to a number > 257 . This file should be included in the lexer, which we did: #include "pyparser.tab.h" .

How does the analyzer. The yyparse function is yyparse from main , which starts the analysis — reads the tokens, performs some actions, and exits if it encounters the end of the input text ( return 0 ) or a syntax error ( return 1 ).

Learn more about the work of Bison: http://www.linux.org.ru/books/GNU/bison/bison_7.html

We try to compile and test what we have:
 % bison -d pyparser.y --verbose && flex lexer.l && g++ -c lex.yy.c pyparser.tab.c && g++ -o parser lex.yy.o pyparser.tab.o -ll -ly 

Test example:
 class Foo: pass class Test(Foo): class Test2: def __init__(self): pass def foo(): pass 

Result:
  >> CLASS: Foo() >> CLASS: Test(Foo) >> CLASS: Test2() >> FUNC: __init__(self) >> FUNC: foo() 

In principle, it is true, although not quite. I would like to see the “full name” in the function definition, that is:
 >> CLASS: Foo() >> CLASS: Test(Foo) >> CLASS: Test2() >> FUNC: Test.Test2.__init__(self) >> FUNC: Test.Test2.foo() 

To do this, we proceed as follows: we will start a stack to which we will add the name of the current class. As soon as the definition of a function is encountered, we gradually pull out the class names from the stack, concatenating with the function name. If a new class is found deeper in the level, we add to the stack, otherwise we remove elements from the stack until we reach the current level of nesting (and another one less), and add a new class to the stack.

The idea and implementation below.

Python Features


The obvious problem is to find out the level of nesting. As you know, in Python, tabulation (or spaces) is used for this. Therefore, it is necessary to store the current indent in some global variable that is accessible to both the analyzer and the lexer. The Bison guide says that the yyparse function expects that the position of the just-parsed token in the text is in the global variable yylloc .

yylloc is a structure of four elements: first_line, first_column, last_line last_column . In first_line we will store the current line number (useful for debugging, and included in my task), in last_column we will keep an indent.

Make changes to the code. In the lexer, we define the type of the variable yylloc and the default value for the line number:
 extern YYLTYPE yylloc; #define YY_USER_INIT yylloc.first_line=1; // =0 

When we meet a new line:
 yylloc.first_line++; yylloc.last_column = 0; isNewLine = true; 

If the line starts with spaces:
 if (isNewLine == true && 0 == yyleng % SPACES_PER_INDENT) yylloc.last_column = yyleng / SPACES_PER_INDENT; isNewLine = false; 

yyleng - the length of the pattern to be found. SPACES_PER_INDENT is set to 4 by default (standard).

If the line starts with a tab character:
 if (isNewLine == true) yylloc.last_column = yyleng; isNewLine = false; 

Now adjust the row count. There are triple quotes in python, usually used for long commentary documentation. To ignore write the function:
 static string skipMessage(string _ch){ string ch; string str = _ch; _ch = _ch[0]; bool skip = false; int count = 1; for(;;ch = yyinput()) { str += ch; if (ch == _ch){ if (count == 3) return str; else count++; } else count = 1; if ("\n" == ch || "\r" == ch) yylloc.first_line++; } } 

Here, in fact, it was possible to get by with a regular expression, but then it would not be possible to determine the line number correctly - we will not be able to find out the number of lines eaten by the reexex (or can we? If yes, write a way).

Also do not forget to ignore the comment line:
 static void skipComments(){ for(char ch = yyinput();;ch = yyinput()) { if ('\0' == ch || '\n' == ch || '\r' == ch) { yylloc.first_line++; break; } } } 


Go to the stack algorithm.

Determine the nesting function


We act according to the algorithm I described above.

For convenience, we define types:
 typedef pair <string, int> meta_data; // <_, > typedef stack <meta_data> meta_stack; static meta_stack class_st; 

We put on the stack a pair of <__, >
 class_def: CLASS classname inheritance COLON suite { int indent = @1.last_column; //    meta_data new_class($2, indent); clean_stack( class_st, indent ); class_st.push( new_class ); } ; 

The function of cleansing the stack to the current indent:
 static void clean_stack( meta_stack& stack, int indent ) { while(!stack.empty()) { if( indent > stack.top().second ) break; stack.pop(); } } 

Well, we determine the full name of the function:
 func_def: DEF funcname LBRACE func_args_list RBRACE COLON suite { string fnc_name = $2; int indent = @1.last_column; clean_stack( class_st, indent ); meta_stack tmp_class_st(class_st); while (!tmp_class_st.empty()) { fnc_name = tmp_class_st.top().first + "." + fnc_name; tmp_class_st.pop(); } } 


Conclusion


I did not describe in the article the definition of a function call, since it is actually similar to finding a function declaration, although there are difficulties there. If there is interest, here is my repository on github: https://github.com/sagod/pyparser . Comments, comments and pull requests are welcome.

Literature


Source: https://habr.com/ru/post/141756/


All Articles