Grammar/Grammar
file as regular expressions (with a slightly different syntax). According to this grammar, during the compilation of python
using Parser/pgen
, a whole set of finite state machines is generated that recognize the given regular expressions — one spacecraft for each nonterminal. The format of the resulting set of spacecraft is described in Include/grammar.h
, and the spacecraft itself is defined in Python/graminit.c
, in the form of a global structure _PyParser_Grammar
. Terminal symbols are defined in Include/token.h
, and they correspond to the numbers 0..56; non-terminal numbers start at 256.
if 42: print("Hello world")
.
single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt if_stmt: 'if' test ':' suite ('elif' test ':' suite) * ['else' ':' suite] suite: simple_stmt | NEWLINE INDENT stmt + DEDENT simple_stmt: small_stmt (';' small_stmt) * [';'] NEWLINE small_stmt: expr_stmt | del_stmt | pass_stmt | flow_stmt | import_stmt | global_stmt | nonlocal_stmt | assert_stmt expr_stmt: testlist_star_expr (annassign | augassign (yield_expr | testlist) | ('=' (yield_expr | testlist_star_expr)) *) testlist_star_expr: (test | star_expr) (',' (test | star_expr)) * [','] test: or_test ['if' or_test 'else' test] | lambdef or_test: and_test ('or' and_test) * and_test: not_test ('and' not_test) * not_test: 'not' not_test | comparison comparison: expr (comp_op expr) * expr: xor_expr ('|' xor_expr) * xor_expr: and_expr ('^' and_expr) * and_expr: shift_expr ('&' shift_expr) * shift_expr: arith_expr (('<<' | '>>') arith_expr) * arith_expr: term (('+' | '-') term) * term: factor (('*' | '@' | '/' | '%' | '//') factor) * factor: ('+' | '-' | '~') factor | power power: atom_expr ['**' factor] atom_expr: [AWAIT] atom trailer * atom: '(' [yield_expr | testlist_comp] ')' | '[' [testlist_comp] ']' | '{' [dictorsetmaker] '}' | NAME | NUMBER | STRING + | '...' | 'None' | 'True' | 'False' trailer: '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME arglist: argument (',' argument) * [','] argument: test [comp_for] | test '=' test | '**' test | '*' test
static arc arcs_0_0[3] = { {2, 1}, {3, 1}, {4, 2}, }; static arc arcs_0_1[1] = { {0, 1}, }; static arc arcs_0_2[1] = { {2, 1}, }; static state states_0[3] = { {3, arcs_0_0}, {1, arcs_0_1}, {1, arcs_0_2}, }; //... static arc arcs_39_0[9] = { {91, 1}, {92, 1}, {93, 1}, {94, 1}, {95, 1}, {19, 1}, {18, 1}, {17, 1}, {96, 1}, }; static arc arcs_39_1[1] = { {0, 1}, }; static state states_39[2] = { {9, arcs_39_0}, {1, arcs_39_1}, }; //... static arc arcs_41_0[1] = { {97, 1}, }; static arc arcs_41_1[1] = { {26, 2}, }; static arc arcs_41_2[1] = { {27, 3}, }; static arc arcs_41_3[1] = { {28, 4}, }; static arc arcs_41_4[3] = { {98, 1}, {99, 5}, {0, 4}, }; static arc arcs_41_5[1] = { {27, 6}, }; static arc arcs_41_6[1] = { {28, 7}, }; static arc arcs_41_7[1] = { {0, 7}, }; static state states_41[8] = { {1, arcs_41_0}, {1, arcs_41_1}, {1, arcs_41_2}, {1, arcs_41_3}, {3, arcs_41_4}, {1, arcs_41_5}, {1, arcs_41_6}, {1, arcs_41_7}, }; //... static dfa dfas[85] = { {256, "single_input", 0, 3, states_0, "\004\050\340\000\002\000\000\000\012\076\011\007\262\004\020\002\000\300\220\050\037\102"}, //... {270, "simple_stmt", 0, 4, states_14, "\000\040\200\000\002\000\000\000\012\076\011\007\000\000\020\002\000\300\220\050\037\100"}, //... {295, "compound_stmt", 0, 2, states_39, "\000\010\140\000\000\000\000\000\000\000\000\000\262\004\000\000\000\000\000\000\000\002"}, {296, "async_stmt", 0, 3, states_40, "\000\000\040\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"}, {297, "if_stmt", 0, 8, states_41, "\000\000\000\000\000\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\000\000"}, // ^^^ //... }; static label labels[176] = { {0, "EMPTY"}, {256, 0}, {4, 0}, // №2 {270, 0}, // №3 {295, 0}, // №4 //... {305, 0}, // №26 {11, 0}, // №27 //... {297, 0}, // №91 {298, 0}, {299, 0}, {300, 0}, {301, 0}, {296, 0}, {1, "if"}, // №97 //... }; grammar _PyParser_Grammar = { 86, dfas, {176, labels}, 256 };
single_input
; this spacecraft is defined by the states_0
array of three states. Three arcs ( arcs_0_0
) correspond to the initial state, corresponding to the input symbols NEWLINE
(label No. 2, symbol No. 4), simple_stmt
(label No. 3, symbol No. 270) and compound_stmt
(label No. 4, symbol No. 295). The input terminal "if"
(symbol No. 1, label No. 97) is included in the d_first
set for compound_stmt
, but not for simple_stmt
— it corresponds to the \ 002 bit in the 13th byte of the set. (If during parsing it turns out that the next terminal enters the d_first
sets for several outgoing arcs at once, then the parser displays a message stating that the grammar is ambiguous and refuses to continue parsing.) So, the parser goes over the {4, 2}
arc to the state No. 2, and simultaneously switches to the spacecraft compound_stmt
specified by the states_39
array. In this KA, nine arcs go out of the initial state at once ( arcs_39_0
); among them we choose the arc {91, 1}
corresponding to the input symbol if_stmt
( if_stmt
297). The parser goes to the state # 1 and switches to the if_stmt
spacecraft specified by the states_41
array.
{97, 1}
, corresponding to our input terminal "if"
. According to it, the parser goes to the state # 1, from where the single arc {26, 2}
, which corresponds to the non-terminal test
(# 305), again comes out. Along this arc, the parser goes to state # 2 and switches to the spacecraft test
, and so on. After a long-long transformation of the number 42
into a nonterminal test
, the parser will return to the state # 2, from which the only arc {27, 3}
, corresponding to the COLON
terminal (symbol # 11), will continue and will continue to parse the nonterminal if_stmt
. As a result of its parsing, the parser will create a node of a specific syntactic tree ( node
structure), which will have node type 297, and four children - types 1 ( NAME
), 305 ( test
), 11 ( COLON
) and 304 ( suite
). In state # 4, the creation of a node for if_stmt
completed, the parser returns to state # 1 of the KA compound_stmt
, and the newly created node for if_stmt
becomes the only child of the node corresponding to compound_stmt
. The entire QDC of our program will look like the one shown on the right. Notice how the short program of the seven terminals NAME NUMBER COLON NAME LPAR STRING RPAR
turned into a "bamboo" - a huge, almost unbranching parse tree from as many as 64 nodes! If you count in bytes, then the source program takes 27 bytes, and its QDC is two orders of magnitude more: one and a half kilobytes on a 32-bit system, and three - on a 64-bit one! (64 nodes of 24 or 48 bytes each). It is because of the rampant KSD that trying to skip source files of just a few tens of megabytes through python
results in a fatal MemoryError
.
parser
module:
$ python Python 3.7.0a0 (default:98c078fca8e0, Oct 31 2016, 08:33:23) [GCC 4.7.3] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import parser >>> parser.suite('if 42: print("Hello world")').tolist() [257, [269, [295, [297, [1, 'if'], [305, [309, [310, [311, [312, [315, [316, [317, [318, [319, [320, [321, [322, [323, [324, [2, '42']]]]]]]]]]]]]]]], [11, ':'], [304, [270, [271, [272, [274, [305, [309, [310, [311, [312, [315, [316, [317, [318, [319, [320, [321, [322, [323, [324, [1, 'print']], [326, [7, '('], [334, [335, [305, [309, [310, [311, [312, [315, [316, [317, [318, [319, [320, [321, [322, [323, [324, [3, '"Hello world"']]]]]]]]]]]]]]]]]], [8, ')']]]]]]]]]]]]]]]]]]], [4, '']]]]]], [4, ''], [0, '']] >>>
Modules/parsermodule.c
), there were> 2000 handwritten lines to check the QDC for compliance with the Python grammar, which looked like this:
//... /* simple_stmt | compound_stmt * */ static int validate_stmt(node *tree) { int res = (validate_ntype(tree, stmt) && validate_numnodes(tree, 1, "stmt")); if (res) { tree = CHILD(tree, 0); if (TYPE(tree) == simple_stmt) res = validate_simple_stmt(tree); else res = validate_compound_stmt(tree); } return (res); } static int validate_small_stmt(node *tree) { int nch = NCH(tree); int res = validate_numnodes(tree, 1, "small_stmt"); if (res) { int ntype = TYPE(CHILD(tree, 0)); if ( (ntype == expr_stmt) || (ntype == del_stmt) || (ntype == pass_stmt) || (ntype == flow_stmt) || (ntype == import_stmt) || (ntype == global_stmt) || (ntype == nonlocal_stmt) || (ntype == assert_stmt)) res = validate_node(CHILD(tree, 0)); else { res = 0; err_string("illegal small_stmt child type"); } } else if (nch == 1) { res = 0; PyErr_Format(parser_error, "Unrecognized child node of small_stmt: %d.", TYPE(CHILD(tree, 0))); } return (res); } /* compound_stmt: * if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated */ static int validate_compound_stmt(node *tree) { int res = (validate_ntype(tree, compound_stmt) && validate_numnodes(tree, 1, "compound_stmt")); int ntype; if (!res) return (0); tree = CHILD(tree, 0); ntype = TYPE(tree); if ( (ntype == if_stmt) || (ntype == while_stmt) || (ntype == for_stmt) || (ntype == try_stmt) || (ntype == with_stmt) || (ntype == funcdef) || (ntype == async_stmt) || (ntype == classdef) || (ntype == decorated)) res = validate_node(tree); else { res = 0; PyErr_Format(parser_error, "Illegal compound statement type: %d.", TYPE(tree)); } return (res); } //...
Modules/parsermodule.c
are no longer there, but then there are a few dozen lines of my code.
Python/ast.c
), written entirely by hand, bypasses the CFC and creates an abstract syntax tree from it . The grammar of the SDA is described in the Parser/Python.asdl
; for our program, the SDA will be as shown on the right.
parser
module, the documentation advises using the ast
module and working with an abstract tree:
$ python Python 3.7.0a0 (default:98c078fca8e0, Oct 31 2016, 08:33:23) [GCC 4.7.3] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import ast >>> ast.dump(ast.parse('if 42: print("Hello world")')) "Module(body=[If(test=Num(n=42), body=[Expr(value=Call(func=Name(id='print', ctx=Load()), args=[Str(s='Hello world')], keywords=[]))], orelse=[])])" >>>
dict
literal), the size of the CFC will determine the memory consumption of the script. All this in addition to the fact that the size of the KSD determines whether there is enough memory to load and run the program.
Python/ast.c
just disgusting:
static expr_ty ast_for_expr(struct compiling *c, const node *n) { //... loop: switch (TYPE(n)) { case test: case test_nocond: if (TYPE(CHILD(n, 0)) == lambdef || TYPE(CHILD(n, 0)) == lambdef_nocond) return ast_for_lambdef(c, CHILD(n, 0)); else if (NCH(n) > 1) return ast_for_ifexpr(c, n); /* Fallthrough */ case or_test: case and_test: if (NCH(n) == 1) { n = CHILD(n, 0); goto loop; } // case not_test: if (NCH(n) == 1) { n = CHILD(n, 0); goto loop; } // not_test case comparison: if (NCH(n) == 1) { n = CHILD(n, 0); goto loop; } // comparison case star_expr: return ast_for_starred(c, n); /* The next five cases all handle BinOps. The main body of code is the same in each case, but the switch turned inside out to reuse the code for each type of operator. */ case expr: case xor_expr: case and_expr: case shift_expr: case arith_expr: case term: if (NCH(n) == 1) { n = CHILD(n, 0); goto loop; } return ast_for_binop(c, n); // case yield_expr: case factor: if (NCH(n) == 1) { n = CHILD(n, 0); goto loop; } return ast_for_factor(c, n); case power: return ast_for_power(c, n); default: PyErr_Format(PyExc_SystemError, "unhandled expr: %d", TYPE(n)); return NULL; } /* should never get here unless if error is set */ return NULL; }
if (NCH(n) == 1) n = CHILD(n, 0);
- sometimes, as in this function, inside the cycle - means “if the current node has only one child, then instead of the current node it is necessary to consider its child”.
goto loop;
all over the code, from the look of which Dijkstra would have been spinning in a coffin!
Modules/parsermodule.c
, I proposed another patch , which:
diff -r ffc915a55a72 Parser/parser.c --- a/Parser/parser.c Mon Sep 05 00:01:47 2016 -0400 +++ b/Parser/parser.c Mon Sep 05 08:30:19 2016 +0100 @@ -52,16 +52,16 @@ #ifdef Py_DEBUG static void s_pop(stack *s) { if (s_empty(s)) Py_FatalError("s_pop: parser stack underflow -- FATAL"); - s->s_top++; + PyNode_Compress(s->s_top++->s_parent); } #else /* !Py_DEBUG */ -#define s_pop(s) (s)->s_top++ +#define s_pop(s) PyNode_Compress((s)->s_top++->s_parent) #endif diff -r ffc915a55a72 Python/ast.c --- a/Python/ast.c Mon Sep 05 00:01:47 2016 -0400 +++ b/Python/ast.c Mon Sep 05 08:30:19 2016 +0100 @@ -5070,3 +5056,24 @@ FstringParser_Dealloc(&state); return NULL; } + +void PyNode_Compress(node* n) { + if (NCH(n) == 1) { + node* ch; + switch (TYPE(n)) { + case suite: case comp_op: case subscript: case atom_expr: + case power: case factor: case expr: case xor_expr: + case and_expr: case shift_expr: case arith_expr: case term: + case comparison: case testlist_star_expr: case testlist: + case test: case test_nocond: case or_test: case and_test: + case not_test: case stmt: case dotted_as_name: + if (STR(n) != NULL) + PyObject_FREE(STR(n)); + ch = CHILD(n, 0); + *n = *ch; + // All grandchildren are now adopted; don't need to free them, + // so no need for PyNode_Free + PyObject_FREE(ch); + } + } +}
ast_for_expr
function ast_for_expr
replaced with:
static expr_ty ast_for_expr(struct compiling *c, const node *n) { //... switch (TYPE(n)) { case lambdef: case lambdef_nocond: return ast_for_lambdef(c, n); case test: case test_nocond: assert(NCH(n) > 1); return ast_for_ifexpr(c, n); case or_test: case and_test: assert(NCH(n) > 1); // case not_test: assert(NCH(n) > 1); // not_test case comparison: assert(NCH(n) > 1); // comparison case star_expr: return ast_for_starred(c, n); /* The next five cases all handle BinOps. The main body of code is the same in each case, but the switch turned inside out to reuse the code for each type of operator. */ case expr: case xor_expr: case and_expr: case shift_expr: case arith_expr: case term: assert(NCH(n) > 1); return ast_for_binop(c, n); // case yield_expr: case factor: assert(NCH(n) > 1); return ast_for_factor(c, n); case power: return ast_for_power(c, n); case atom: return ast_for_atom(c, n); case atom_expr: return ast_for_atom_expr(c, n); default: PyErr_Format(PyExc_SystemError, "unhandled expr: %d", TYPE(n)); return NULL; } /* should never get here unless if error is set */ return NULL; }
Modules/parsermodule.c
instead of “raw” _PyParser_Grammar
now need to use automata corresponding to the “transitive closure” of the grammar: for example, for a pair of products—The following "transitive" products are added:or_test :: = and_test test :: = or_test 'if' or_test 'else' test
test :: = or_test 'if' and_test 'else' test test :: = and_test 'if' or_test 'else' test test :: = and_test 'if' and_test 'else' test
parser
module, the Init_ValidationGrammar
function bypasses the auto-generated spacecraft in _PyParser_Grammar
, creates “transitively closed” spacecraft on the basis of them, and stores them in the ValidationGrammar
structure. ValidationGrammar
used to check the correctness of the CFC. $ python Python 3.7.0a0 (default:98c078fca8e0+, Oct 31 2016, 09:00:27) [GCC 4.7.3] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import parser >>> parser.suite('if 42: print("Hello world")').tolist() [257, [295, [297, [1, 'if'], [324, [2, '42']], [11, ':'], [270, [271, [272, [323, [324, [1, 'print']], [326, [7, '('], [334, [335, [324, [3, '"Hello world"']]]], [8, ')']]]]], [4, '']]]], [4, ''], [0, '']] >>>
python
process up to 30% , with unchanged operating time. In degenerate examples, the reduction in memory consumption reaches threefold !
Source: https://habr.com/ru/post/314062/