Introduction
For about two years I have been participating in the
SourceAnalyzer OpenSource project, and now I have to write a parser for the Python language, which should be able to build a call graph (Call Graph) and a class dependency graph (Class Graph Dependency). More precisely, the graph is constructed using other tools, and the parser only needs to prepare data for these tools.
The process of working on the parser was quite entertaining and I would like to share with you the experience gained, as well as tell about some of the pitfalls that met during the development phase.
Terminology
If you have already worked with grammars and know how the compiler works, feel free to step into the next paragraph, the rest is Welcome.
The parser development process was divided into two main components:
- Analysis of the input stream and splitting it into tokens (Lexical analysis)
- Parsing tokens by a set of syntax rules (Parsing)
Let's look at a little example. Let there is an expression (or input data stream):
')
3 + 4 * 15
Construct the rules for converting the input stream:
[1-9]* -> NUMBER + -> PLUS * -> MULT
Thus, after lexical analysis we get:
NUMBER(3) PLUS(+) NUMBER(4) MULT(*) NUMBER(15) EQUAL(=)
The following is an example of a grammar, with the help of which you can later parse the expression using the rules:
exp: NUMBER | exp sign exp sign: PLUS | MULT * / \ + 15 / \ 3 4
In this example,
NUMBER, MULT, PLUS
- by definition, terminals, or tokens, defined at the lexical analysis stage.
expr, sign
are not terminals, since they are constituent units.
This introduction is not in any way exhaustive, therefore the theory should be addressed to the book Flex & Bison O'Relly.
Grammar
The full grammar for the Python language (version 2.5.2) can be found here:
http://docs.python.org/release/2.5.2/ref/grammar.txtMy task was only to identify the definition of classes, functions, and function calls.
To begin, let us describe the necessary part of the grammar using the extended Backus-Naur form (RBNF) (
wiki ).
class_def = CLASS classname [inheritance] COLON suite classname = ID inheritance = LBRACE class_args_list RBRACE class_args_list = [class_arg] class_arg = dotted_name {COMMA dotted_name} dotted_name = ID {DOT ID}
Here
[X]
is 0 or 1 occurrences of
X
,
{Y}
- 0 or more occurrences of
Y
To determine the class name and its dependencies is enough. Now for the functions.
func_def = DEF funcname LBRACE [func_args_list] RBRACE COLON suite funcname = ID func_args_list = [func_arg] func_arg = (dotted_name | star_arg) {OTHER | COMMA | dotted_name | star_arg | MESSAGE} star_arg = [STAR] STAR ID
By the way, it is assumed that the user code will be correct (from the point of view of the interpreter), otherwise the rules of grammar must be defined more strictly.
Well, let's finish with the grammar and go to the lexer (lexical analyzer), since the source python code should be divided into tokens before parsing the grammar.
Lexical analyzer
Lexer is generated by the Flex program. The simplest example is:
%{ #include <stdio.h> %} %% start printf("START\n"); stop printf("STOP\n"); . ; /* */ %%
How to compile this example:
% flex lexer.l && gcc lex.yy.c -o lexer -lfl
Learn to describe grammar for lexer:
http://rus-linux.net/lib.php?name=/MyLDP/algol/lex-yacc-howto.htmlOK, now we will define tokens. In grammar, we have already used the following:
CLASS, COLON, COMMA, DEF, DOT, ID, LBRACE, MESSAGE, RBRACE, OTHER, STAR
We also need
DEFINED
- reserved Python words.
We make lexer.
Code:
https://gist.github.com/2158334A brief analysis: comments, blank lines and spaces are ignored. All the rest (the so-called stream of tokens) is given to the input Bison.
A character set that finds a pattern (for example, following the
[ \t]+
pattern) is placed in
yytext
. By default,
yytext
is a char pointer, but if you add the
-l
yytext
when compiling, then
yytext
will be interpreted as a char array. Before returning the token to the bison, we write the value determined from the template into the
yylval
(for more details, see below).
It's time to go to the description of grammar for bison.
Bison
Learn to describe the grammar for bison:
http://rus-linux.net/lib.php?name=/MyLDP/algol/lex-yacc-howto.htmlI refer you again to this article, because those who cannot compose a grammar for a bison will easily learn through the material on this link. Well, who knows how - well, go ahead.
So, looking at the point of the Grammar article, let's try to make a grammar for the bison:
Code:
https://gist.github.com/2158315In the Bison rule, each lexeme has a meaning. The value of the group collected by the current rule is stored in
$
, the values of the other tokens are stored in
$1, $2, …
test_rule: token1, token2, token3; $$ $1 $2 $3
The value stored in
$n
is nothing more than the value of the
yylval
at the time the lexer returns a token.
Starting Bison with the
-d
parameter, the file
.tab.h
generated, it contains the macro definitions of the types of tokens that are used in the lexer. Each token corresponds to a number
> 257
. This file should be included in the lexer, which we did:
#include "pyparser.tab.h"
.
How does the analyzer. The
yyparse
function is
yyparse
from
main
, which starts the analysis — reads the tokens, performs some actions, and exits if it encounters the end of the input text (
return 0
) or a syntax error (
return 1
).
Learn more about the work of Bison:
http://www.linux.org.ru/books/GNU/bison/bison_7.htmlWe try to compile and test what we have:
% bison -d pyparser.y --verbose && flex lexer.l && g++ -c lex.yy.c pyparser.tab.c && g++ -o parser lex.yy.o pyparser.tab.o -ll -ly
Test example:
class Foo: pass class Test(Foo): class Test2: def __init__(self): pass def foo(): pass
Result:
>> CLASS: Foo() >> CLASS: Test(Foo) >> CLASS: Test2() >> FUNC: __init__(self) >> FUNC: foo()
In principle, it is true, although not quite. I would like to see the “full name” in the function definition, that is:
>> CLASS: Foo() >> CLASS: Test(Foo) >> CLASS: Test2() >> FUNC: Test.Test2.__init__(self) >> FUNC: Test.Test2.foo()
To do this, we proceed as follows: we will start a stack to which we will add the name of the current class. As soon as the definition of a function is encountered, we gradually pull out the class names from the stack, concatenating with the function name. If a new class is found deeper in the level, we add to the stack, otherwise we remove elements from the stack until we reach the current level of nesting (and another one less), and add a new class to the stack.
The idea and implementation below.
Python Features
The obvious problem is to find out the level of nesting. As you know, in Python, tabulation (or spaces) is used for this. Therefore, it is necessary to store the current indent in some global variable that is accessible to both the analyzer and the lexer. The Bison guide says that the yyparse function expects that the position of the just-parsed token in the text is in the global variable
yylloc
.
yylloc
is a structure of four elements:
first_line, first_column, last_line last_column
. In
first_line
we will store the current line number (useful for debugging, and included in my task), in
last_column
we will keep an indent.
Make changes to the code. In the lexer, we define the type of the variable yylloc and the default value for the line number:
extern YYLTYPE yylloc; #define YY_USER_INIT yylloc.first_line=1;
When we meet a new line:
yylloc.first_line++; yylloc.last_column = 0; isNewLine = true;
If the line starts with spaces:
if (isNewLine == true && 0 == yyleng % SPACES_PER_INDENT) yylloc.last_column = yyleng / SPACES_PER_INDENT; isNewLine = false;
yyleng
- the length of the pattern to be found.
SPACES_PER_INDENT
is set to 4 by default (standard).
If the line starts with a tab character:
if (isNewLine == true) yylloc.last_column = yyleng; isNewLine = false;
Now adjust the row count. There are triple quotes in python, usually used for long commentary documentation. To ignore write the function:
static string skipMessage(string _ch){ string ch; string str = _ch; _ch = _ch[0]; bool skip = false; int count = 1; for(;;ch = yyinput()) { str += ch; if (ch == _ch){ if (count == 3) return str; else count++; } else count = 1; if ("\n" == ch || "\r" == ch) yylloc.first_line++; } }
Here, in fact, it was possible to get by with a regular expression, but then it would not be possible to determine the line number correctly - we will not be able to find out the number of lines eaten by the reexex (or can we? If yes, write a way).
Also do not forget to ignore the comment line:
static void skipComments(){ for(char ch = yyinput();;ch = yyinput()) { if ('\0' == ch || '\n' == ch || '\r' == ch) { yylloc.first_line++; break; } } }
Go to the stack algorithm.
Determine the nesting function
We act according to the algorithm I described above.
For convenience, we define types:
typedef pair <string, int> meta_data;
We put on the stack a pair of
<__, >
class_def: CLASS classname inheritance COLON suite { int indent = @1.last_column;
The function of cleansing the stack to the current indent:
static void clean_stack( meta_stack& stack, int indent ) { while(!stack.empty()) { if( indent > stack.top().second ) break; stack.pop(); } }
Well, we determine the full name of the function:
func_def: DEF funcname LBRACE func_args_list RBRACE COLON suite { string fnc_name = $2; int indent = @1.last_column; clean_stack( class_st, indent ); meta_stack tmp_class_st(class_st); while (!tmp_class_st.empty()) { fnc_name = tmp_class_st.top().first + "." + fnc_name; tmp_class_st.pop(); } }
Conclusion
I did not describe in the article the definition of a function call, since it is actually similar to finding a function declaration, although there are difficulties there. If there is interest, here is my repository on github:
https://github.com/sagod/pyparser . Comments, comments and pull requests are welcome.
Literature