If your education is over after the second class, if vocabulary is limited, and speech is not clear, if you are just stupid, you do not know these incomprehensible Latin letters, but you still want to become a programmer, our bydloyazyk Yoba will help you. Yoba is a language for real guys!
But seriously, somehow we have at work someone jokingly suggested writing a gop-language so that anyone could feel himself a programmer. Starting construction with the word "cho" and all that. It should be noted here that, having not met computer science education in my life, I missed all those interesting courses on building compilers, formal grammars and other goodies that normal students like in their second or third year. Wirth's book on building compilers, although I added knowledge of all sorts of clever terms like BNF, but I didn’t bring any practical benefit — I never wrote a single compiler. Therefore, the task turned out to be quite interesting for me.
If you are over 18, adequately perceive the obscene vocabulary of our native language and you are wondering where to start, welcome under cat.
Of all the tools for working with grammar, I have met only a few mentions. Firstly, it’s of course lex, flex, yacc, bison and antlr, which are used primarily for C / C ++ compilers. Secondly, it is ocamllex and ocamlyacc - which are intended for the development of a compiler, as it is easy to guess, on a scale. And thirdly, it is fslexx and fsyacc - the same, only for the F # language, which is copied from ocaml a little less than completely, but later it was overgrown with a rather large number of buns. F # I dismissed as missing in my system, and all plus funds - because of their multiplicity. Yes, sometimes a rich choice is also bad, I was afraid to get lost in all these lexers and parsers.
By the way, it was a rather strange and unpleasant discovery for me that it turns out that the analysis of any language consists of two separate stages that are performed by different tools, it seemed to me that why not do everything in one. But alas.
')
Types of language
I will not lie, I did not immediately think through the entire architecture of the language and finished the features gradually. But to reduce the size of the material (and to quickly go through the path that I took six beautiful morning Sunday hours), we pretend that everything was planned in advance. Each instruction of our language is, let's say, an object. In C ++, it could be represented as one of the descendants of a certain base class Instruction, or as a complex structure, or even as a union. Fortunately, okaml allows us to do everything as simple as possible. Our first file, yobaType.ml, which describes all possible types of instructions, is as simple as possible:
type action =
DoNothing
| AddFunction of string * action list
| CallFunction of string
| Stats
| Create of string
| Conditional of int * string * action * action
| Decrement of int * string
| Increment of int * string;;
Each language construct will be cast to one of these types. DoNothing is just a NOP operator, it doesn’t do anything. Create creates a variable (we always have integers), Decrement and Increment, respectively, decrease and increase the specified variable by a certain number. In addition, there are Stats for displaying statistics on all created variables and Conditional is our if implementation, which can check if a given variable has a required value (or large). At the very end, I added AddFunction and CallFunction - the ability to create and call your own functions, which are in fact very procedures.
Grammar language
The next stage we will describe the grammar, those constructions of our language, which will be transformed into instructions of one of the types.
Before showing the code, I will tell you how our language works. Any construction other than requesting statistics (this is our service team, as it were) and the creation of a function begins and ends with keywords. Because of this, we can safely arrange any line breaks and indents. In addition to this (we are working with the Russian language), I specifically created a pair of instructions for the cases when it is necessary to pass both a variable and a value. Later you will see why it was needed. So, our file yobaParser.mly:
%{
open YobaType
%}
%token <string> ID
%token <int> INT
%token RULEZ
%token GIVE TAKE
%token WASSUP DAMN
%token CONTAINS THEN ELSE
%token FUCKOFF
%token STATS
%token MEMORIZE IS
%token CALL
%start main
%type <YobaType.action> main
%%
main:
expr { $1 }
expr:
fullcommand { $1 }
| MEMORIZE ID IS fullcommandlist DAMN { AddFunction($2, $4) }
fullcommandlist:
fullcommand { $1 :: [] }
| fullcommand fullcommandlist { $1 :: $2 }
fullcommand:
WASSUP command DAMN { $2 }
| STATS { Stats }
command:
FUCKOFF { DoNothing }
| GIVE ID INT { Increment($3, $2) }
| GIVE INT ID { Increment($2, $3) }
| TAKE ID INT { Decrement($3, $2) }
| TAKE INT ID { Decrement($2, $3) }
| RULEZ ID { Create($2) }
| CALL ID { CallFunction($2) }
| CONTAINS ID INT THEN command ELSE command { Conditional($3, $2, $5, $7) }
| CONTAINS INT ID THEN command ELSE command { Conditional($2, $3, $5, $7) }
%%
First we insert the header - the opening of the YobaType module, which contains our type of action, described at the very beginning. For numbers and strings that are not keywords of the language (variables), we declare two special types, which indicate what they contain. For each of the keywords using the% token directive, we also create our own type, which will identify this word in the grammar. It would be possible to indicate all of them at least in one line, just such a record groups everything according to the types of instructions. Keep in mind that all the tokens we create are exactly the wildcard types by which the grammar parser determines what to do. You can call them as you please, how they will look in the language itself, we will describe later. We indicate that the input point for the grammar is main, and that it must always return an object of type action — the instruction for the interpreter. Finally, after two %% signs, we describe the grammar itself:
- The instruction consists of either a command (fullcommand) or a function creation.
- The function, in turn, consists of a list of commands (fullcommandlist).
- A command can be either service (STATS) or normal (command), in which case it should be wrapped in keywords.
- With an ordinary team, everything is simple, I will not even paint.
In curly brackets, we indicate what to do if the string matches this option, and $ N denotes the Nth member of the construct. For example, if we meet “CALL ID” (ID is a string, do not forget), then we create a CallFunction instruction, with which we pass $ 2 as a parameter (just ID) - the name of the function being called.
Lexer - turn language into keywords
We reached at the same time almost the simplest and most dreary part. It is simple, because it only describes the transformation of the words of a language into our tokens. And it's dreary, because lexers (and maybe only Oklalovsky lexer) are poorly designed to work with the Russian language, so you can work with Russian characters only as with strings. Since I wanted to make the keywords of the language register-independent, this added a bunch of hemorrhoids - instead of simply writing “give”, it was necessary to paint the variant of writing each letter. In general, see for yourself the file yobaLexer.mll:
- {
- open YobaParser
- exception Eof
- }
- rule token = parse
- ( "and" | "And" ) ( "d" | "D" ) ( "and" | "And" ) ( ' ' ) +
- ( "n" | "H" ) ( "a" | "A" ) ( "x" | "X" ) ( "y" | "Y" ) ( "y" | "Y" ) { FUCKOFF }
- | ( "b" | "B" ) ( "a" | "A" ) ( "l" | "L" ) ( "a" | "A" )
- ( "n" | "H" ) ( "c" | "C" ) ( ' ' ) +
- ( "n" | "H" ) ( "a" | "A" ) ( "x" | "X" ) { STATS }
- | [ ' ' ' \ t ' ' \ n ' ' \ r ' ] { token lexbuf }
- | [ ' 0 ' - ' 9 ' ] + { INT ( int_of_string ( Lexing . Lexeme lexbuf ) ) }
- | ( "d" | "D" ) ( "a" | "A" ) ( "y" | "Y" ) { GIVE }
- | ( "n" | "H" ) ( "a" | "A" ) { TAKE }
- | ( "h" | "H" ) ( "o" | "O" ) { WASSUP }
- | ( "y" | "Y" ) ( "o" | "O" ) ( "b" | "B" ) ( "a" | "A" ) { DAMN }
- | ( "l" | "l" ) ( "yu" | "yu" ) ( "b" | "b" ) ( "l" | " l " ) ( "u" | "yu" ) { RULEZ }
- | ( "" | "" ) ( "" | "" ) ( "t" | "" ) ( "" | "" ) { CONTAINS }
- | ( "t" | "T" ) ( "a" | "A" ) ( "d" | "D" ) ( "a" | "A" ) { THEN }
- | ( "and" | "And" ) ( "l" | "L" ) ( "and" | "And" ) { ELSE }
- | ( "y" | "y" ) ( "c" | "c" ) ( "e" | "e" ) ( "k" | "k" ) ( "and" | "and" ) { MEMORIZE }
- | ( "e" | "e" ) ( "t" | "t" ) ( "o" | "o" ) { IS }
- | ( "x" | "X" ) ( "y" | "Y" ) ( "y" | "Y" ) ( "n" | "H" ) ( "and" | "And" ) { CALL }
- |
- ( "a" | "b" | "in" | "g" | "d" | "e" | "e" | "g"
- | "h" | "and" | "y" | "to" | "l" | "m" | "n" | "about"
- | "p" | "p" | "with" | "t" | "y" | "f" | "x" | "c"
- | "h" | "sh" | "u" | "" | "s" | "" | "e" | "yu" | "i" ) + { id ( Lexing . lexeme lexbuf ) }
- | eof { raise Eof }
I just note that this code cannot be shown to children and people with fragile mental organization, because all obscene language of the language is described here. In addition, at the beginning we open the module of our parser, in which all types are defined (STATS, GIVE, TAKE, etc.) and create an exception that will be thrown when the input is completed. Pay attention to the 11th line, it indicates that spaces and other garbage need to be ignored, while it specifically comes after those instructions that contain spaces. The 12th and 25-28th lines handle the numbers and names of variables, and the 29th line throws the very exception that marks the end of the file.
Interpreter
The last part remains - the interpreter itself, which processes our language constructs. First the code, then the explanation:
- open YobaType
- let identifiers = Hashtbl . create 10 ;;
- let funcs = Hashtbl . create 10 ;;
- let print_stats ( ) =
- let print_item id amount =
- Printf . printf ">> Yo! You have% s:% d" id amount ;
- print_newline ( ) ;
- flush stdout in
- Hashtbl . iter print_item identifiers ;;
- let arithm id op value ( ) =
- try
- Hashtbl . replace identifiers id ( op ( Hashtbl . find identifiers id ) value ) ;
- Printf . printf ">> shit issue \ n" ; flush stdout
- with Not_found -> Printf . printf ">> X @ # on, you don't love% s \ n" id ; flush stdout ;;
- let rec cond amount id act1 act2 ( ) =
- try
- if Hashtbl . find identifiers id > = amount then process_action act1 ( ) else process_action act2 ( )
- with Not_found ->
- Printf . printf ">> more often?! \ n" ;
- flush stdout
- and process_action = function
- | Create ( id ) -> ( function ( ) -> Hashtbl . Add identifiers id 0 )
- | Decrement ( amount, id ) -> arithm id ( - ) amount
- | Increment ( amount, id ) -> arithm id ( + ) amount
- | Conditional ( amount, id, act1, act2 ) -> cond amount id act1 act1
- | DoNothing -> ( function ( ) -> ( ) )
- | Stats -> print_stats
- | AddFunction ( id, funclist ) -> ( function ( ) -> Hashtbl . Add funcs id funclist )
- | CallFunction ( id ) -> callfun id
- and callfun id ( ) =
- let f : YobaType . action list = Hashtbl . find funcs id in
- List iter ( function x -> process_action x ( ) ) f
- ;;
- while true do
- try
- let lexbuf = Lexing . from_channel stdin in
- process_action ( YobaParser . main YobaLexer . token lexbuf ) ( )
- with
- YobaLexer . Eof ->
- print_stats ( ) ;
- exit 0
- | Parsing . Parse_error ->
- Printf . printf ">> Nee @ # I don’t understand b @ #! \ n" ;
- flush stdout
- | Failure ( _ ) ->
- Printf . printf ">> Nee @ # I don’t understand b @ #! \ n" ;
- flush stdout
- done
First we create two hashtables - for variables and for functions. The initial size of 10 is taken from the lantern, we have the same training language, why do we need many functions at once.
Then we will declare two small functions: one for outputting statistics, the second for increment / decrement of variables.
Then comes a group of three functions at once: cond processes conditional constructs (our if), callfun is responsible for calling functions, and process_action is responsible for processing the instruction that came to the input as such. I hope, why all three functions depend on each other, it is not necessary to explain.
Note that all options in process_action do not perform an action, but merely return the function that will execute it. Initially, this was not the case, but it was this small change that allowed me to easily and easily add support for user functions to the language.
Finally, the last part of the code reads and processes the result of the parser until the blue in the loop.
Add to this Makefile and you can play:
all:
ocamlc -c yobaType.ml
ocamllex yobaLexer.mll
ocamlyacc yobaParser.mly
ocamlc -c yobaParser.mli
ocamlc -c yobaLexer.ml
ocamlc -c yobaParser.ml
ocamlc -c yoba.ml
ocamlc -o yoba yobaLexer.cmo yobaParser.cmo yoba.cmo
clean:
rm -f *.cmo *.cmi *.mli yoba yobaLexer.ml yobaParser.ml
Language check
And now attention, why did I support different order of values ​​and variables? The fact is that the language is very, um, natural, the interpreter literally speaks to you. Therefore, I myself constantly confused in what order what to write:
$ ./yoba
3
4
2
@#
>>
>>
>>
>> ! : 7
>> ! : -2
4 1 @#
>>
>> ! : 7
>> ! : -1
4 @# @#
>>
>>
>>
>> ! : 14
>> ! : -3
Among the shortcomings, it can be noted that when a function is called, comments about the successful increment / decrement by the number of these operations are obtained.
I hope it was interesting. All interpreter code can be downloaded
here (2kb) .
If more experienced people share comments on the code, I will be grateful.