📜 ⬆️ ⬇️

We write an elegant parser on Python

In C ++ 17 (no, no, Python will be soon, you entered correctly!) A new syntax for the if appears, allowing you to declare variables right in the block header. This is quite convenient because the constructions of the form

 Foo foo = make_foo(); if(foo.is_nice()) { // do work with foo } // never use foo again // foo gets deleted 

pretty common. The code above with a slight movement of the programmer’s hand (and the heavy movement of the standardization committee’s hand) turns into:

 if(Foo foo = make_foo(); foo.is_nice()) { // do work with foo } // foo gets deleted // never use foo again (well, you can't anyway) 

It has become a little better, although it still does not look perfect. There is no such thing in Python, but if you hate if in Python code as much as I do and want to learn how to quickly write simple parsers, then welcome to cat. In this article we will try to write a short and elegant parser for JSON in Python 2 (without any additional modules, of course).

What is parsing and what it eats


Parsing (in Russian, “syntactic analysis”) is the immortal task of disassembling and transforming into meaningful units something written in a fixed language, be it a programming language, a markup language, a structured query language, or the main language of life, the Universe, and all that. A typical sequence of problem solving steps looks like this:
')
  1. Describe the language . Of course, you first need to decide which task we are solving. Usually, a language description is another variation of the Backus-Naur form . ( Here , for example, the description of the Python grammar, which is used when building its parser.) At the same time, both the rules of "constructing sentences" in the language and the rules for defining valid words are established.

  2. Split the input into tokens . A lexical analyzer is written (popularly a tokenizer) that breaks the input string or file into a sequence of tokens , that is, valid words of our language (or whines, that this cannot be done).

  3. Check syntax and build syntax tree . We check if the sequence of tokens matches the description of our language. Here, algorithms like the recursive descent method are used. Each valid sentence of a language includes a finite number of valid words or other valid sentences; if the tokens were able to form a coherent picture, then at the output we automatically get a tree, which is called an abstract syntax tree .

  4. Make it finally work . You have a syntax tree and you can finally do what you wanted: calculate the value of the arithmetic expression, organize the query in the database, compile the program, display the web page, and so on.

In general, this area has been studied along and across and is full of remarkable results, and hundreds (possibly good) books have been written on it. However, the theoretical solvability of the problem and writing code is not the same thing.

Model problem


We will illustrate parser writing in a simple, but not completely trivial example - parsing JSON. The grammar looks like this:

 root ::= value value ::= string | number | object | array | 'true' | 'false' | 'null' array ::= '[' ']' | '[' comma-separated-values ']' comma-separated-values ::= value | value ',' comma-separated-values object ::= '{' '}' | '{' comma-separated-keyvalues '}' comma-separated-keyvalues ::= keyvalue | keyvalue ',' comma-separated-keyvalues keyvalue ::= string ':' value 

There are no rules for string and number - they, along with all the strings in quotes, will be our tokens.

Parsim json


We will not write a full tokenizer (this is boring and not quite the topic of the article) - we will work with the whole line and beat it with tokens as necessary. We write the first two functions:

 import re #  re.DOTALL        number_regex = re.compile(r"(-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?)\s*(.*)", re.DOTALL) def parse_number(src): match = number_regex.match(src) if match is not None: number, src = match.groups() return eval(number), src #  eval -   ,    string_regex = re.compile(r"('(?:[^\\']|\\['\\/bfnrt]|\\u[0-9a-fA-F]{4})*?')\s*(.*)", re.DOTALL) def parse_string(src): match = string_regex.match(src) if match is not None: string, src = match.groups() return eval(string), src #     JSON' #    ,    

(I promised without if'ov, but this is the last, honestly!)

For everything else, we will write one function that generates simple parser functions:

 def parse_word(word, value=None): l = len(word) def result(src): #      .lower()  case-insensitive ! if src.startswith(word): #  if!   ! return value, src[l:].lstrip() # lstrip    , .  result.__name__ = "parse_%s" % word #   return result parse_true = parse_word("true", True) parse_false = parse_word("false", False) parse_null = parse_word("null", None) 

Total, on what basis we build our functions:

  1. They take the string to parse.
  2. They return a pair (the result, the remaining_string) for success (that is, when the required construction was found at the beginning of the line) and None for failure.
  3. They send into oblivion all the whitespace between the tokens. (Do not do this if you are writing the Python parser!)

Actually, on these three functions, problems with tokens are solved, and we can move on to the interesting part.

Parsim rule with branching


What should the parse_value function look like for the grammar above? Usually something like this:

 def parse_value(src): #     match = parse_string(src) if match is not None: # ! return match #  ;      match = parse_number(src) if match is not None: return match #    .    ... # ... 

Well, no, these if got me!

Let's change the three functions above in a most unexpected way: replace return with yield ! Now they return the generators — empty if the parsing failed, and with exactly one element if successful. Yes, we are expanding our principle number 2 by 90 degrees: we will now write all our functions in this style:

 number_regex = re.compile(r"(-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?)\s*(.*)", re.DOTALL) def parse_number(src): match = number_regex.match(src) if match is not None: number, src = match.groups() yield eval(number), src #       yield,    . #   ,    . string_regex = re.compile(r"('(?:[^\\']|\\['\\/bfnrt]|\\u[0-9a-fA-F]{4})*?')\s*(.*)", re.DOTALL) def parse_string(src): match = string_regex.match(src) if match is not None: string, src = match.groups() yield eval(string), src def parse_word(word, value=None): l = len(word) def result(src): if src.startswith(word): yield value, src[l:].rstrip() result.__name__ = "parse_%s" % word return result #   -,  yield' parse_true = parse_word("true", True) parse_false = parse_word("false", False) parse_null = parse_word("null", None) 

What will our parse_value turn parse_value ? At first glance, in something like this:

 def parse_value(src): for match in parse_string(src): yield match return for match in parse_number(src): yield match return # ... 

But at a second glance, we will see that each option can take only one line!

 #   itertools.chain     #  ,       from itertools import chain def parse_value(src): for match in chain( parse_string(src), parse_number(src), parse_array(src), parse_object(src), parse_true(src), parse_false(src), parse_null(src), ): #   ,     yield match return 

At the same time, the efficiency remains at the same level - each function will start to be executed (and therefore, to do work, checking regular expressions) only when the previous one fails. return ensures that the extra work will not be performed if somewhere in the middle of the list the parsing is successful.

Parsim construction sequences


Let's move on to the next number of our program - the parse_array function. It should look something like this:

 parse_left_square_bracket = parse_word("[") parse_right_square_bracket = parse_word("]") def parse_array(src): #     tsrc,     #        "" for _, tsrc in parse_left_square_bracket(src): for _, tsrc in parse_right_square_bracket(tsrc): #   ,      '['  ']' yield [], tsrc return #   src    --        for _, src in parse_left_square_bracket(src): for items, src in parse_comma_separated_values(src): for _, src in parse_right_square_bracket(src): yield items, src #      yield,      

Not a single if , as promised, but something is still wrong ... Let's write a small helper function that will help us connect the parser functions in sequence, just like the chain helped connect them in the "or" mode. This function will have to carefully take all the results and return all the first elements of the results (analysis results) and the last second element (the remaining unanalyzed part of the line). My version looks like this:

 def sequence(*funcs): if len(funcs) == 0: #  ,      if' def result(src): yield (), src return result def result(src): for arg1, src in funcs[0](src): for others, src in sequence(*funcs[1:])(src): yield (arg1,) + others, src #    return result 

With this powerful (albeit scary) tool, our function will be rewritten as:

 parse_left_square_bracket = parse_word("[") parse_right_square_bracket = parse_word("]") parse_empty_array = sequence(parse_left_square_bracket, parse_right_square_bracket) def parse_array(src): for _, src in parse_empty_array(src): # ,    ,   [] yield [], src return #  return ,      #     {} {"a": 1} for (_, items, _), src in sequence( parse_left_square_bracket, parse_comma_separated_values, parse_right_square_bracket, )(src): yield items, src #      yield,      

Well, add the function parse_comma_separated_values - just spit:

 parse_comma = parse_word(",") def parse_comma_separated_values(src): for (value, _, values), src in sequence( parse_value, parse_comma, parse_comma_separated_values #  ,       if?   )(src): yield [value] + values, src return for value, src in parse_value(src): yield [value], src 

Will such a decision lead to infinite recursion? Not! Once the function parse_comma does not find another comma, and the subsequent parse_comma_separated_values will not parse_comma_separated_values executed anymore.

Go ahead! An object:

 parse_left_curly_bracket = parse_word("{") parse_right_curly_bracket = parse_word("}") parse_empty_object = sequence(parse_left_curly_bracket, parse_right_curly_bracket) def parse_object(src): for _, src in parse_empty_object(src): yield {}, src return for (_, items, _), src in sequence( parse_left_curly_bracket, parse_comma_separated_keyvalues, parse_right_curly_bracket, )(src): yield items, src parse_colon = parse_word(":") def parse_keyvalue(src): for (key, _, value), src in sequence( parse_string, parse_colon, parse_value )(src): yield {key: value}, src def parse_comma_separated_keyvalues(src): for (keyvalue, _, keyvalues), src in sequence( parse_keyvalue, parse_comma, parse_comma_separated_keyvalues, #   ,   )(src): keyvalue.update(keyvalues) yield keyvalue, src return for keyvalue, src in parse_keyvalue(src): #  ,         yield keyvalue, src 

Well, what's next?


Actually, everything! It remains to add a simple interface function:

 def parse(s): s = s.strip() #      ,     match = list(parse_value(s)) if len(match) != 1: # - -   .       :) raise ValueError("not a valid JSON string") result, src = match[0] if src.strip(): #  ,     - .  . raise ValueError("not a valid JSON string") return result 

Voila!

All code together
 from itertools import chain import re def sequence(*funcs): if len(funcs) == 0: def result(src): yield (), src return result def result(src): for arg1, src in funcs[0](src): for others, src in sequence(*funcs[1:])(src): yield (arg1,) + others, src return result number_regex = re.compile(r"(-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?)\s*(.*)", re.DOTALL) def parse_number(src): match = number_regex.match(src) if match is not None: number, src = match.groups() yield eval(number), src string_regex = re.compile(r"('(?:[^\\']|\\['\\/bfnrt]|\\u[0-9a-fA-F]{4})*?')\s*(.*)", re.DOTALL) def parse_string(src): match = string_regex.match(src) if match is not None: string, src = match.groups() yield eval(string), src def parse_word(word, value=None): l = len(word) def result(src): if src.startswith(word): yield value, src[l:].lstrip() result.__name__ = "parse_%s" % word return result parse_true = parse_word("true", True) parse_false = parse_word("false", False) parse_null = parse_word("null", None) def parse_value(src): for match in chain( parse_string(src), parse_number(src), parse_array(src), parse_object(src), parse_true(src), parse_false(src), parse_null(src), ): yield match return parse_left_square_bracket = parse_word("[") parse_right_square_bracket = parse_word("]") parse_empty_array = sequence(parse_left_square_bracket, parse_right_square_bracket) def parse_array(src): for _, src in parse_empty_array(src): yield [], src return for (_, items, _), src in sequence( parse_left_square_bracket, parse_comma_separated_values, parse_right_square_bracket, )(src): yield items, src parse_comma = parse_word(",") def parse_comma_separated_values(src): for (value, _, values), src in sequence( parse_value, parse_comma, parse_comma_separated_values )(src): yield [value] + values, src return for value, src in parse_value(src): yield [value], src parse_left_curly_bracket = parse_word("{") parse_right_curly_bracket = parse_word("}") parse_empty_object = sequence(parse_left_curly_bracket, parse_right_curly_bracket) def parse_object(src): for _, src in parse_empty_object(src): yield {}, src return for (_, items, _), src in sequence( parse_left_curly_bracket, parse_comma_separated_keyvalues, parse_right_curly_bracket, )(src): yield items, src parse_colon = parse_word(":") def parse_keyvalue(src): for (key, _, value), src in sequence( parse_string, parse_colon, parse_value )(src): yield {key: value}, src def parse_comma_separated_keyvalues(src): for (keyvalue, _, keyvalues), src in sequence( parse_keyvalue, parse_comma, parse_comma_separated_keyvalues, )(src): keyvalue.update(keyvalues) yield keyvalue, src return for keyvalue, src in parse_keyvalue(src): yield keyvalue, src def parse(s): s = s.strip() match = list(parse_value(s)) if len(match) != 1: raise ValueError("not a valid JSON string") result, src = match[0] if src.strip(): raise ValueError("not a valid JSON string") return result 


130 lines. Let's try to run:

 >>> import my_json >>> my_json.parse("null") >>> my_json.parse("true") True >>> my_json.parse("false") False >>> my_json.parse("0.31415926E1") 3.1415926 >>> my_json.parse("[1, true, '1']") [1, True, '1'] >>> my_json.parse("{}") {} >>> my_json.parse("{'a': 1, 'b': null}") {'a': 1, 'b': None} 

Success!

Conclusion


Of course, I have considered far from all the situations that may arise when writing parsers. Sometimes a programmer may need to manually control execution, rather than launching a sequence of chain ov and sequence ov. Fortunately, this is not so inconvenient in the considered approach, as it may seem. So, if you need to try to parse an optional construction and make an action depending on its presence, you can write:

 for stuff, src in parse_optional_stuff(src): #     --  break #   else else: #    --  pass 

Here we use the unpopular Python chip - the else block of cycles, which is executed if the cycle has reached the end without a break . It doesn’t look as attractive as our code in the article, but it’s definitely not worse than those if we’ve got rid of so gracefully.

Despite the incompleteness and lack of academic presentation, I hope that this article will be useful for beginner programmers, and maybe even surprised by the new approach of advanced programmers. At the same time, I am well aware that this is just a new form for the good old recursive descent; but if programming is an art, isn't the form important in it, if not on a par, then at least to a degree close to the content? ..

As usual, without delay, write to the PM about any inaccuracies found, spelling, grammatical and factual errors - otherwise I will burn with shame!

Source: https://habr.com/ru/post/309242/


All Articles