📜 ⬆️ ⬇️

Lightweight Parser Designer with Interactive Mode

Periodically faced with small puzzles to develop simple text analyzers, I decided to automate this task, and the academic interest did not give rest. Initially I looked in the direction of Racc (one of the interpretations of Yacc ), but it seemed to me not enough for a simple solution for my small tasks and then I decided to develop my own simple analyzer. Although, of course, if you are developing a compiler or something like that and still on an industrial scale, then you should definitely look in the direction of Racc .

But if you want to figure out for yourself what a parser is and how quickly you can write it yourself, without reading a bunch of articles about lexical analyzers like a dragon book, then go ahead under the cat (although the book is very good).

Input flow analysis


If we consider the task of analyzing the input stream in a very abstract way, and I think we should start with this, then it all comes down to the implementation of the finite state machine, i.e. state machines, where the condition for the transition to a certain state is the presence in the input stream of the necessary data (hereinafter referred to as templates). In turn, the transition to the state can be performed only when the unique characters and the full match of the characters in the input stream with a specific template, why unambiguous, because transition from one state to several states at once is possible and while the input symbols of the stream coincide with several state patterns simultaneously, the analyzer is in a state of uncertainty. Based on the foregoing, the analyzer itself may be in the following states:


Of course, erroneous states are possible, but since we consider the generalized model of the analyzer and do not know on which sets of characters an error should be issued, then this task falls on the shoulders of a specific implementation.
')
In a state of certainty, we know exactly which state handler we need to call, but in a state of uncertainty, we don’t know what the next state will be, i.e. which template will be a match or eventually there will be no one-to-one match, then in this case we should buffer the input stream and then transfer the characters from the accumulated buffer to an already defined state, i.e. in case of uncertainty, we buffer the input stream and already in the state change mode we must transfer the buffer data to a new state or return the current one if the transition has not happened.

Now, you can decompose, or more precisely, the detailing of the process of analyzing the input stream to:
classification of the input stream (coincides with one, with many and does not coincide with any template) and processing the result of the classification (transition, buffering, etc.).

Implementation


Although, in my main occupation, I am an embedded system developer and my main programming language is ANSI C, I decided to implement this model in Ruby, because on it you think less about the technical details of the implementation and planned to implement this model as Ruby Gem. I really love everything universal (within certain limits), short and concise. In the future, I plan to implement this model in emdedded for the input traffic analyzer, but in a more simplified form and possibly on HDL.

So, let's start, I repeat, Ruby doesn’t know very deeply, so I relied on the accumulated experience and precise algorithmization of the process in my head, therefore I considered that my knowledge of Ruby is quite enough for this, as my music teacher once said: “Learn to improvise and make music on three-five notes, and then you will thrash the steep passages, otherwise your continuous arpeggio is not possible to listen”. Sometimes, remembering him, now I think that he was right, although then I wanted cool solyak, okay, let's go back to the topic.

I think the development process should go in stages, and the stages should be “tangible”, i.e. you need to understand what needs to be done to implement this stage and in what amounts (complexity), I’ll not go deep about this now, not this topic of our conversation, but to implement the parser, its build-up and testing should go on gradually building up the code with each state. And here I could not do without interactive mode, if someone uses my gem, I think he should also initially debug the behavior of his model in interactive mode, and only then write parser state handlers.

Example


For greater clarity, I will give an example of the implementation of a simple source file parser for the automatic generation of documentation, remotely resembling doxygen. Initially, I need to install my gem:

$ gem install iparser 

As you already understood, the gem is called iparser and after its installation we will create a file named 'parser_example.rb', in which we will write the code of our parser based on iparser . First of all, we connect the library and create a parser machine object that implements the main logic of the analyzer, i.e. copy of the parser itself:

 require 'iparser' parser = Iparser::Machine.new 

The next step is to describe the state of the parser. We will have a very simple parser and it will have only three states:


Now we can describe each state separately (I will give the script code as a whole below). The state of 'idle' I will not describe in detail, because in our case, it is empty, and the state of 'comment-line' will be considered closer. To create an instance of the state, use the following code:

 ps_idle = Iparser::State.new('idle') ps_cline = Iparser::State.new('comment-line') 

In the parameters of the state constructor, you specify a string with the name (identifier) ​​of the state, preferably understandable to you, since used for remote debugging of the model. Each parser state object has an entry field, this is a pattern that matches the transition to this state, i.e. condition entry state. Comparison with the template is performed character-by-character, as well as processing the input stream. Entry to the state will be made in the presence in the input stream of characters '\\\', going in a row. There is also a leave field to specify a pattern to exit the state (return to the previous one), in our case these are the end of line '\ n \ r' characters, although it is enough to indicate the first one, but for clarity, we will indicate both. Now we will describe the 'comment-line':

 ps_cline.entry << /\// ps_cline.entry << /\// ps_cline.entry << /\// ps_cline.leave << /[\n\r]/ 

Please note that patterns are set using regular expressions, you can and without them, but this later.

Next, create a state for processing multi-line comments, we will get into it when we meet the characters '/ **', and leave it if there is a '* /'. Also write:

 ps_cblock = Iparser::State.new('comment-block') ps_cblock.entry << /\// ps_cblock.entry << /\*/ ps_cblock.entry << /\*/ ps_cblock.leave << /\*/ ps_cblock.leave << /\// ps_cblock.ignore << '*' 

We also specified characters that should be ignored while in this state. We have a star symbol, because I like to write multiline comments like:

 /** * ... * ... */ 

Now we should link our three states into a single chain, from 'idle' we can get into 'comment-line' and 'comment-block', and only one of them back into 'idle'. Linking is performed by specifying state indexes in the branches field of each of the states. The index will be determined by the order of adding state objects to the parser instance; the addstate parser method is used to add objects to the parser.

 ps_idle.branches << 1 ps_idle.branches << 2 parser.addstate ps_idle parser.addstate ps_cline parser.addstate ps_cblock 

And finally, we need to check whether we have correctly created a chain of states, for this we will launch an interactive mode (the prestart method will set all initial parameters):

 parser.prestart parser.interactive_parser 

For greater clarity, I will give the script code as a whole, of course, I rearranged it a bit, but it contains only the code I wrote above:

 require 'iparser' # # Create parser-machine object. # parser = Iparser::Machine.new # # Create startup state for this parser-machine. # ps_idle = Iparser::State.new('idle') # # Add branch indexes to 'comment-line' and 'comment-block' state. # ps_idle.branches << 1 ps_idle.branches << 2 # # Create single line comment state for this parser-machine. # ps_cline = Iparser::State.new('comment-line') ps_cline.entry << /\// ps_cline.entry << /\// ps_cline.entry << /\// ps_cline.leave << /[\n\r]/ # # Create multiline comment state for this parser-machine. # ps_cblock = Iparser::State.new('comment-block') ps_cblock.entry << /\// ps_cblock.entry << /\*/ ps_cblock.entry << /\*/ ps_cblock.leave << /\*/ ps_cblock.leave << /\// ps_cblock.ignore << '*' # # Add all states to parser-machine. # parser.addstate ps_idle parser.addstate ps_cline parser.addstate ps_cblock # # Call parser startup method. # parser.prestart # # Call interactive mode for check state-machine. # parser.interactive_parser 

Not so big code, now run the script:

 $ ruby parser_example.rb 

To exit the interactive mode, you must enter an empty line (just press enter immediately).

Enter '\\\' and see that on the last character the parser goes into the state of 'comment-line' ( branch to comment-line ), now enter '\ n' or '\ r' and see that the parser goes back to the 'state idle '( back ). The character '\' is used to enter esc-characters, respectively, to enter the character '\' you need to type it twice '\\', as well as when specifying strings in C and other languages. In interactive mode, you can mess around as much as you like, until you are sure that the chain of all states is connected correctly.

The final stage, considering that we checked and debugged the logic of transitions of our machine, you can add handlers for our states (and only now, because I wrote above about the development process), they will be quite simple. Each state can have three handlers:


The state constructor will be called only once upon entering the state and will receive as an argument an array of buffered characters that the parser has accumulated while in the uncertainty mode. The handler will be called continuously and will receive the input symbol as an argument, until it leaves the state, and the destructor will be called only once when leaving the state.

The handler method can also control the operation of the parser, although it may be that I implemented it for nothing, while there is, then it is necessary to describe. If the handler returns the data type Fixnum whose value is within (> = 0), then it will be interpreted as an index to go into some state, if the index goes beyond the state array, the parser will throw an exception. If the handler returns nil , the parser will hold this state, i.e. there is no reaction and if any other value, then it is regarded as an error and the parser returns false , which signals a parsing error, because in all other cases, it returns true . Below I will give the modified code completely, because it seems that I have tried to describe everything in detail.

 require 'iparser' # # Simple check startup arguments. # if( ARGV.size != 1 || !File.exist?(ARGV[0]) ) puts puts "ERROR: unable to open file #{ARGV[0]}" puts exit end # # Create output file. # $fout = File.new( 'index.html', 'w' ) # # Create initializer method for parser-states. # def doc_init ( str ) $fout.print "<p>" end # # Create handler method for parser-states. # def doc_handler ( c ) $fout.print c end # # Create finalizer method for parser-states. # def doc_fini ( str ) $fout.puts "</p>" end # # Create parser-machine object. # parser = Iparser::Machine.new # # Create startup state for this parser-machine. # ps_idle = Iparser::State.new('idle') # # Add branch indexes to 'comment-line' and 'comment-block' state. # ps_idle.branches << 1 ps_idle.branches << 2 # # Create single line comment state for this parser-machine. # ps_cline = Iparser::State.new('comment-line') ps_cline.entry << /\// ps_cline.entry << /\// ps_cline.entry << /\// ps_cline.leave << /[\n\r]/ # # Create multiline comment state for this parser-machine. # ps_cblock = Iparser::State.new('comment-block') ps_cblock.entry << /\// ps_cblock.entry << /\*/ ps_cblock.entry << /\*/ ps_cblock.leave << /\*/ ps_cblock.leave << /\// ps_cblock.ignore << '*' # # Add handlers for states. # ps_cline.init( method(:doc_init) ) ps_cline.handler( method(:doc_handler) ) ps_cline.fini( method(:doc_fini) ) ps_cblock.init( method(:doc_init) ) ps_cblock.handler( method(:doc_handler) ) ps_cblock.fini( method(:doc_fini) ) # # Add all states to parser-machine. # parser.addstate ps_idle parser.addstate ps_cline parser.addstate ps_cblock # # Call parser startup method. # parser.prestart # # Call interactive mode for check state-machine. # $fout.puts "<html>" $fout.puts "<body>" File.open( ARGV[0], 'r' ).each do |line| line.each_char do |c| parser.parse(c) end end $fout.puts "</body>" $fout.puts "</html>" $fout.close 

Yes, note that our handler (doc_handler) does not return nil , since we do not analyze the result of the parser (method parser.parse ). For the last check we will create a test file with the name 'test.c' and fill it with the following contents:

 #include <stdlib.h> ///Test function - 1. void test1 ( void ) { } /** * Test function - 2. */ void test2 ( void ) { } 

Run our script for the last time and see the result of our parser.

 $ ruby parser_example.rb test.c 

This is the file 'index.html', that's all, thank you all for your attention!

I hope that someone will help and reduce the time to learn the basic principles of analyzers, and maybe someone will want to use my gem for personal purposes without finding a simpler option, if you didn’t like it, don’t kick much, although of course, criticize After all, one cannot become a good creator without good criticism, either. If anyone is interested, here is a link to gem iparser .

Source: https://habr.com/ru/post/271969/


All Articles