Practical use of the Racc - LALR generator (1) parser for Ruby

As part of creating a framework for some Enterprise class system, I had the task of creating a utility for automated code generation using a UML model. Nothing most suitable for a quick and effective solution of the problem, except for the use of Ruby, and the built-in ERB template engine, has not turned up.

The project file from the UML modeling environment was a database of SQLite3 format, however, the environment stored some of the information in this database as serialized BLOB objects. The serialization format was textual, but not compatible with any of the known ones, such as XML, YAML, quite remotely resembled JSON. It was impossible to use parsers existing in nature.

In simple cases, when you do not need the entire object, but only a pair of scalar fields of a specific instance, then of course you can stupidly get to the desired regulars. Otherwise, there is a universal solution to the problem, which allows you to quickly create your own parsers for such structures, deserializing them into Ruby objects.

This is how the original serialized objects look like.
The challenge is to get this into the complex structure of Ruby based on Array and Hash.
')

data.txt

7ghoJdyGAqACCgeT:"User":Class { FromEndRelationships=( <vdHoJdyGAqACCgfe>, <9bToJdyGAqACCgfF> ); _masterViewId="7ghoJdyGAqACCgeS"; pmLastModified="1355667704781"; pmAuthor="author"; Child=( {UwZoJdyGAqACCgei:"name":Attribute { visibility=71; pmLastModified="1355667655234"; pmAuthor="author"; type=<_n2oJdyGAqACCgXh>; pmCreateDateTime="1355667628234"; _modelViews=NULL; _modelEditable=T; }}, {9lZoJdyGAqACCgel:"created":Attribute { visibility=71; type_string="date"; pmLastModified="1355667655234"; pmAuthor="author"; pmCreateDateTime="1355667630703"; _modelViews=NULL; _modelEditable=T; }}, {nLFoJdyGAqACCgeo:"active":Attribute { visibility=71; pmLastModified="1355667655234"; pmAuthor="author"; type=<_n2oJdyGAqACCgXY>; pmCreateDateTime="1355667639609"; _modelViews=NULL; _modelEditable=T; }} ); pmCreateDateTime="1355667607671"; _modelViews=( {4QhoJdyGAqACCgeU:"View":ModelView { container=<hguoJdyGAqACCgeL>; view="7ghoJdyGAqACCgeS"; }} ); _modelEditable=T; }

If we analyze the original format, the following elements can be distinguished in it, which clearly define its grammar:

An object. It starts with a format string Id: "Name": Type , and then contains a set of attributes in curly braces.
Object Attributes Indicated as key = value , plus a semicolon.
Attribute value May be a string, a number, a link, another object, or a collection of any of the listed values.
Values are specified for each type of notation. Strings enclosed in quotes. The object is enclosed in braces. The collection of values is enclosed in parentheses, separated by commas.

Such a grammar is context-free; it can easily be described using the Backus-Naur form . And for parsing use the ascending shift / convolution algorithm. One of the most common algorithms in this category is LALR (1) , it is used in such well-known “compiler compilers” like Yacc / GNU Bison. There is also an implementation for the Ruby platform we are interested in - this is Racc . We will use it to create our parser.

Ascending shift / convolution algorithm

In order to get a basic understanding of Racc and create elements of our future parser, it is sufficient in a general simplified way to represent the principle of operation of the upward shift-convolution algorithm.

Suppose we have an input stream of characters:

+ +

We also set grammar rules:

+
+

The incoming flow is divided into two parts by a marker: the analyzed left, and the unanalyzed right. The marker is visually represented by the '|' symbol. The algorithm sequentially reads one character from the stream; this is a shift operation (Shift) that moves the marker one position to the right. The next step is to perform the convolution operation (Reduce) for the left part of the marker, in accordance with the rules of grammar. And so on until the end of the characters and all convolutions will not be satisfied.

s: | + +
r: | + +
s: + | +
r: + | +
s: + | +
r: | +
s: + |
r: + |
s: + |
r: |

Install Racc

The runtime for Racc-based parsers is already included in Ruby. Starting with version 1.8.x, any parser created with Racc will work without additional gestures. However, in order to be able to create your own parsers, you need to install the Racc package, which contains the parser generator. This is done as standard via gem:

# gem install racc

Creating a parser on racc

The source for the Racc parser generator is a file (usually with a .y extension) containing a definition of the Ruby class of the parser, and including additional Racc directives. The definition of the parser must contain the following elements:

Definitions of terminal grammar characters - tokens
Definitions of grammar and convolution rules
Lexical analyzer

Tokens

So, the first thing that needs to be done is to distinguish in the grammar the so-called terminal symbols - elementary particles from which more complex structures are composed, and give them names. We get something like this:

T_OBJECTID - object identifier
T_IDENTIFIER - attribute identifier
T_STRING - text string
T_NUMBER - number
T_BOOLEAN - boolean value
T_NULL - null
T_REFERENCE - link
T_LBR, T_RBR, T_LPAR, T_RPAR, T_EQ, T_COMMA, T_SEMICOLON - punctuation marks: different brackets, equal sign, comma, semicolon

In the parser class definition, valid tokens are defined using the token keyword.

Lexical analyzer

The next step is to create a lexical analyzer. Its functions include scanning the incoming stream and detecting tokens in it. The lexical analyzer must transform the incoming stream into a set of consecutive tokens. Racc, when performing a shift operation, must receive a new token from the incoming stream, converted by a lexical analyzer. It does this by calling the next_token method of the parser class, which must return a structure containing the name and value of the token:

 [:TOKEN_NAME, 'Token value']

As a marker of the end of the stream is the value:

 [false, false]

To create a lexical analyzer, it is convenient to use the StringScanner class. Sometimes it is important to choose the order of scanning patterns, as some tokens can overlap others. In this example, the lexical analyzer performs the full processing of the entire incoming stream immediately at the start of the parser, calling the tokenize method, and storing the resulting tokens into the array, from which the next token method next_token each time receives the next token.

Grammar rules

The next step we describe the rules of grammar. In Racc, this is done with the rule directive, following which rules are set using the Backus-Naur form. Non-terminal symbols are expressed through terminal and non-terminal. For example, in the following fragment, a nonterminal attribute is specified - an attribute of an object that, in accordance with our grammar, is an identifier, followed by an equal sign, an arbitrary value expressed by another nonterminal, and a trailing semicolon:

 attribute: T_IDENTIFIER T_EQ value T_SEMICOLON { result = { val[0] => val[2] } };

In curly braces, after the rule, a Ruby expression is specified that should perform a convolution of a given sequence of characters before the declared non-terminal, in this case, attribute . The variables val and result are predefined by racc. The val array contains the values of a collapsible sequence, in this case val [0] contains the value T_IDENTIFIER, val [2] value value. The result variable is the result of the collapsed value.

To start the parser, you must call the do_parse method. This method is defined in the Racc :: Parser class, from which our parser class will be inherited.

Below is the complete source code for the parser definition.

parser.y

 class Parser token T_OBJECTID T_STRING T_REFERENCE T_IDENTIFIER T_NUMBER T_BOOLEAN T_NULL token T_LBR T_RBR T_LPAR T_RPAR T_EQ T_COMMA T_SEMICOLON start input rule input: object { @result = val[0] }; object: T_OBJECTID T_LBR attributes T_RBR { oid = val[0].split(':'); result = { :id => dequote(oid[0]), :name => convertNULL(dequote(oid[1])), :type => dequote(oid[2]) }.merge(val[2]) }; attributes: attribute { result = val[0] } | attributes attribute { result = val[0].merge(val[1]) } ; attribute: T_IDENTIFIER T_EQ value T_SEMICOLON { result = { val[0] => val[2] } }; values: value { result = [val[0]] } | values T_COMMA value { val[0] << val[2] } ; value: T_STRING { result = dequote(val[0]) } | T_REFERENCE { result = dequote(val[0]) } | T_NUMBER { result = val[0].to_i } | T_BOOLEAN { result = val[0] == 'T' } | T_NULL { result = nil } | T_LBR object T_RBR { result = val[1] } | T_LPAR values T_RPAR { result = val[1] } ; ---- header ---- require 'strscan' ---- inner ---- def tokenize(text) tokens = [] s = StringScanner.new(text) until s.eos? case when s.skip(/\s+/) next when s.scan(/\A[\w\.]+:\"*\w*\"*:\w+/) tokens << [:T_OBJECTID, s[0]] next when s.scan(/\A\{/) tokens << [:T_LBR, nil] next when s.scan(/\A\}/) tokens << [:T_RBR, nil] next when s.scan(/\A\(/) tokens << [:T_LPAR, nil] next when s.scan(/\A\)/) tokens << [:T_RPAR, nil] next when s.scan(/\A\=/) tokens << [:T_EQ, nil] next when s.scan(/\A\,/) tokens << [:T_COMMA, nil] next when s.scan(/\A\;/) tokens << [:T_SEMICOLON, nil] next when s.scan(/\w+/) if (s[0].match(/[0-9]+/)) tokens << [:T_NUMBER, s[0]] elsif (s[0].match(/\A[TF]{1}\Z/)) tokens << [:T_BOOLEAN, s[0]] elsif (s[0].match(/\ANULL\Z/)) tokens << [:T_NULL, nil] else tokens << [:T_IDENTIFIER, s[0]] end next when s.scan(/(["]).*?(?<!\\)\1/m) tokens << [:T_STRING, s[0]] next when s.scan(/\<.*?\>/) tokens << [:T_REFERENCE, s[0]] next else s.getch next end end tokens << [false, false] return tokens end def parse(text) @tokens = tokenize(text) do_parse return @result end def next_token @tokens.shift end def dequote(text) text.gsub(/\A["<]|[">]\Z/, '').strip end def convertNULL(text) text.upcase == "NULL" ? nil : text end

The .y file is not yet a parser. The parser should be generated using the racc utility, and it should be done every time after changing the definition of the parser in the .y file:

# racc parser.y -o parser.rb

The resulting file with the class of the parser can be connected and used in the usual way:

main.rb

 require 'pp' require './parser.rb' parser = Parser.new obj = parser.parse(File.read("data.txt")) puts obj.pretty_inspect

This will be the result of the parser operation - Ruby's complex structure:

output

 { :id => "7ghoJdyGAqACCgeT", :name => "User", :type => "Class", "FromEndRelationships" => ["vdHoJdyGAqACCgfe", "9bToJdyGAqACCgfF"], "_masterViewId" => "7ghoJdyGAqACCgeS", "pmLastModified" => "1355667704781", "pmAuthor" => "author", "Child" => [ { :id => "UwZoJdyGAqACCgei", :name => "name", :type => "Attribute", "visibility" => 71, "pmLastModified" => "1355667655234", "pmAuthor" => "author", "type" => "_n2oJdyGAqACCgXh", "pmCreateDateTime" => "1355667628234", "_modelViews" => nil, "_modelEditable" => true }, { :id => "9lZoJdyGAqACCgel", :name => "created", :type => "Attribute", "visibility" => 71, "type_string" => "date", "pmLastModified" => "1355667655234", "pmAuthor" => "author", "pmCreateDateTime" => "1355667630703", "_modelViews" => nil, "_modelEditable" => true }, { :id => "nLFoJdyGAqACCgeo", :name => "active", :type => "Attribute", "visibility" => 71, "pmLastModified" => "1355667655234", "pmAuthor" => "author", "type" => "_n2oJdyGAqACCgXY", "pmCreateDateTime" => "1355667639609", "_modelViews" => nil, "_modelEditable" => true } ], "pmCreateDateTime" => "1355667607671", "_modelViews" => [ { :id => "4QhoJdyGAqACCgeU", :name => "View", :type => "ModelView", "container" => "hguoJdyGAqACCgeL", "view" => "7ghoJdyGAqACCgeS" } ], "_modelEditable" => true }

Source: https://habr.com/ru/post/164091/

All Articles