As part of creating a framework for some Enterprise class system, I had the task of creating a utility for automated code generation using a UML model. Nothing most suitable for a quick and effective solution of the problem, except for the use of Ruby, and the built-in ERB template engine, has not turned up.
The project file from the UML modeling environment was a database of SQLite3 format, however, the environment stored some of the information in this database as serialized BLOB objects. The serialization format was textual, but not compatible with any of the known ones, such as XML, YAML, quite remotely resembled JSON. It was impossible to use parsers existing in nature.
In simple cases, when you do not need the entire object, but only a pair of scalar fields of a specific instance, then of course you can stupidly get to the desired regulars. Otherwise, there is a universal solution to the problem, which allows you to quickly create your own parsers for such structures, deserializing them into Ruby objects.
This is how the original serialized objects look like.
The challenge is to get this into the complex structure of Ruby based on Array and Hash.
')
data.txt
7ghoJdyGAqACCgeT:"User":Class { FromEndRelationships=( <vdHoJdyGAqACCgfe>, <9bToJdyGAqACCgfF> ); _masterViewId="7ghoJdyGAqACCgeS"; pmLastModified="1355667704781"; pmAuthor="author"; Child=( {UwZoJdyGAqACCgei:"name":Attribute { visibility=71; pmLastModified="1355667655234"; pmAuthor="author"; type=<_n2oJdyGAqACCgXh>; pmCreateDateTime="1355667628234"; _modelViews=NULL; _modelEditable=T; }}, {9lZoJdyGAqACCgel:"created":Attribute { visibility=71; type_string="date"; pmLastModified="1355667655234"; pmAuthor="author"; pmCreateDateTime="1355667630703"; _modelViews=NULL; _modelEditable=T; }}, {nLFoJdyGAqACCgeo:"active":Attribute { visibility=71; pmLastModified="1355667655234"; pmAuthor="author"; type=<_n2oJdyGAqACCgXY>; pmCreateDateTime="1355667639609"; _modelViews=NULL; _modelEditable=T; }} ); pmCreateDateTime="1355667607671"; _modelViews=( {4QhoJdyGAqACCgeU:"View":ModelView { container=<hguoJdyGAqACCgeL>; view="7ghoJdyGAqACCgeS"; }} ); _modelEditable=T; }
If we analyze the original format, the following elements can be distinguished in it, which clearly define its grammar:
- An object. It starts with a format string Id: "Name": Type , and then contains a set of attributes in curly braces.
- Object Attributes Indicated as key = value , plus a semicolon.
- Attribute value May be a string, a number, a link, another object, or a collection of any of the listed values.
- Values are specified for each type of notation. Strings enclosed in quotes. The object is enclosed in braces. The collection of values is enclosed in parentheses, separated by commas.
Such a grammar is context-free; it can easily be described using the
Backus-Naur form . And for parsing use the ascending shift / convolution algorithm. One of the most common algorithms in this category is
LALR (1) , it is used in such well-known “compiler compilers” like Yacc / GNU Bison. There is also an implementation for the Ruby platform we are interested in - this is
Racc . We will use it to create our parser.
Ascending shift / convolution algorithm
In order to get a basic understanding of
Racc and create elements of our future parser, it is sufficient in a general simplified way to represent the principle of operation of the upward shift-convolution algorithm.
Suppose we have an input stream of characters:
+ +
We also set grammar rules:
+
+
The incoming flow is divided into two parts by a marker: the analyzed left, and the unanalyzed right. The marker is visually represented by the '|' symbol. The algorithm sequentially reads one character from the stream; this is a shift operation (Shift) that moves the marker one position to the right. The next step is to perform the convolution operation (Reduce) for the left part of the marker, in accordance with the rules of grammar. And so on until the end of the characters and all convolutions will not be satisfied.
s: | + +
r: | + +
s: + | +
r: + | +
s: + | +
r: | +
s: + |
r: + |
s: + |
r: |
Install Racc
The runtime for Racc-based parsers is already included in Ruby. Starting with version 1.8.x, any parser created with Racc will work without additional gestures. However, in order to be able to create your own parsers, you need to install the Racc package, which contains the parser generator. This is done as standard via gem:
# gem install racc
Creating a parser on racc
The source for the Racc parser generator is a file (usually with a .y extension) containing a definition of the Ruby class of the parser, and including additional Racc directives. The definition of the parser must contain the following elements:
- Definitions of terminal grammar characters - tokens
- Definitions of grammar and convolution rules
- Lexical analyzer
Tokens
So, the first thing that needs to be done is to distinguish in the grammar the so-called terminal symbols - elementary particles from which more complex structures are composed, and give them names. We get something like this:
- T_OBJECTID - object identifier
- T_IDENTIFIER - attribute identifier
- T_STRING - text string
- T_NUMBER - number
- T_BOOLEAN - boolean value
- T_NULL - null
- T_REFERENCE - link
- T_LBR, T_RBR, T_LPAR, T_RPAR, T_EQ, T_COMMA, T_SEMICOLON - punctuation marks: different brackets, equal sign, comma, semicolon
In the parser class definition, valid tokens are defined using the
token keyword.
Lexical analyzer
The next step is to create a lexical analyzer. Its functions include scanning the incoming stream and detecting tokens in it. The lexical analyzer must transform the incoming stream into a set of consecutive tokens. Racc, when performing a shift operation, must receive a new token from the incoming stream, converted by a lexical analyzer. It does this by calling the
next_token method of the parser class, which must return a structure containing the name and value of the token:
[:TOKEN_NAME, 'Token value']
As a marker of the end of the stream is the value:
[false, false]
To create a lexical analyzer, it is convenient to use the
StringScanner class. Sometimes it is important to choose the order of scanning patterns, as some tokens can overlap others. In this example, the lexical analyzer performs the full processing of the entire incoming stream immediately at the start of the parser, calling the
tokenize method, and storing the resulting tokens into the array, from which the next token method
next_token each time receives the next token.
Grammar rules
The next step we describe the rules of grammar. In Racc, this is done with the
rule directive, following which rules are set using the Backus-Naur form. Non-terminal symbols are expressed through terminal and non-terminal. For example, in the following fragment, a nonterminal
attribute is specified - an attribute of an object that, in accordance with our grammar, is an identifier, followed by an equal sign, an arbitrary value expressed by another nonterminal, and a trailing semicolon:
attribute: T_IDENTIFIER T_EQ value T_SEMICOLON { result = { val[0] => val[2] } };
In curly braces, after the rule, a Ruby expression is specified that should perform a convolution of a given sequence of characters before the declared non-terminal, in this case,
attribute . The variables
val and
result are predefined by racc. The
val array contains the values of a collapsible sequence, in this case val [0] contains the value T_IDENTIFIER, val [2] value value. The
result variable is the result of the collapsed value.
To start the parser, you must call the
do_parse method. This method is defined in the
Racc :: Parser class, from which our parser class will be inherited.
Below is the complete source code for the parser definition.
parser.y
class Parser token T_OBJECTID T_STRING T_REFERENCE T_IDENTIFIER T_NUMBER T_BOOLEAN T_NULL token T_LBR T_RBR T_LPAR T_RPAR T_EQ T_COMMA T_SEMICOLON start input rule input: object { @result = val[0] }; object: T_OBJECTID T_LBR attributes T_RBR { oid = val[0].split(':'); result = { :id => dequote(oid[0]), :name => convertNULL(dequote(oid[1])), :type => dequote(oid[2]) }.merge(val[2]) }; attributes: attribute { result = val[0] } | attributes attribute { result = val[0].merge(val[1]) } ; attribute: T_IDENTIFIER T_EQ value T_SEMICOLON { result = { val[0] => val[2] } }; values: value { result = [val[0]] } | values T_COMMA value { val[0] << val[2] } ; value: T_STRING { result = dequote(val[0]) } | T_REFERENCE { result = dequote(val[0]) } | T_NUMBER { result = val[0].to_i } | T_BOOLEAN { result = val[0] == 'T' } | T_NULL { result = nil } | T_LBR object T_RBR { result = val[1] } | T_LPAR values T_RPAR { result = val[1] } ; ---- header ---- require 'strscan' ---- inner ---- def tokenize(text) tokens = [] s = StringScanner.new(text) until s.eos? case when s.skip(/\s+/) next when s.scan(/\A[\w\.]+:\"*\w*\"*:\w+/) tokens << [:T_OBJECTID, s[0]] next when s.scan(/\A\{/) tokens << [:T_LBR, nil] next when s.scan(/\A\}/) tokens << [:T_RBR, nil] next when s.scan(/\A\(/) tokens << [:T_LPAR, nil] next when s.scan(/\A\)/) tokens << [:T_RPAR, nil] next when s.scan(/\A\=/) tokens << [:T_EQ, nil] next when s.scan(/\A\,/) tokens << [:T_COMMA, nil] next when s.scan(/\A\;/) tokens << [:T_SEMICOLON, nil] next when s.scan(/\w+/) if (s[0].match(/[0-9]+/)) tokens << [:T_NUMBER, s[0]] elsif (s[0].match(/\A[TF]{1}\Z/)) tokens << [:T_BOOLEAN, s[0]] elsif (s[0].match(/\ANULL\Z/)) tokens << [:T_NULL, nil] else tokens << [:T_IDENTIFIER, s[0]] end next when s.scan(/(["]).*?(?<!\\)\1/m) tokens << [:T_STRING, s[0]] next when s.scan(/\<.*?\>/) tokens << [:T_REFERENCE, s[0]] next else s.getch next end end tokens << [false, false] return tokens end def parse(text) @tokens = tokenize(text) do_parse return @result end def next_token @tokens.shift end def dequote(text) text.gsub(/\A["<]|[">]\Z/, '').strip end def convertNULL(text) text.upcase == "NULL" ? nil : text end
The .y file is not yet a parser. The parser should be generated using the
racc utility, and it should be done every time after changing the definition of the parser in the .y file:
# racc parser.y -o parser.rb
The resulting file with the class of the parser can be connected and used in the usual way:
main.rb
require 'pp' require './parser.rb' parser = Parser.new obj = parser.parse(File.read("data.txt")) puts obj.pretty_inspect
This will be the result of the parser operation - Ruby's complex structure:
output
{ :id => "7ghoJdyGAqACCgeT", :name => "User", :type => "Class", "FromEndRelationships" => ["vdHoJdyGAqACCgfe", "9bToJdyGAqACCgfF"], "_masterViewId" => "7ghoJdyGAqACCgeS", "pmLastModified" => "1355667704781", "pmAuthor" => "author", "Child" => [ { :id => "UwZoJdyGAqACCgei", :name => "name", :type => "Attribute", "visibility" => 71, "pmLastModified" => "1355667655234", "pmAuthor" => "author", "type" => "_n2oJdyGAqACCgXh", "pmCreateDateTime" => "1355667628234", "_modelViews" => nil, "_modelEditable" => true }, { :id => "9lZoJdyGAqACCgel", :name => "created", :type => "Attribute", "visibility" => 71, "type_string" => "date", "pmLastModified" => "1355667655234", "pmAuthor" => "author", "pmCreateDateTime" => "1355667630703", "_modelViews" => nil, "_modelEditable" => true }, { :id => "nLFoJdyGAqACCgeo", :name => "active", :type => "Attribute", "visibility" => 71, "pmLastModified" => "1355667655234", "pmAuthor" => "author", "type" => "_n2oJdyGAqACCgXY", "pmCreateDateTime" => "1355667639609", "_modelViews" => nil, "_modelEditable" => true } ], "pmCreateDateTime" => "1355667607671", "_modelViews" => [ { :id => "4QhoJdyGAqACCgeU", :name => "View", :type => "ModelView", "container" => "hguoJdyGAqACCgeL", "view" => "7ghoJdyGAqACCgeS" } ], "_modelEditable" => true }