⬆️ ⬇️

re2c - regular expression compiler

The task of extracting certain tokens from a stream of characters is quite common. Often it is solved with the help of lexical analyzers configured by regular expressions. Many analyzers are built on the principle of generating program code, which in turn implements the logic of regular expressions. In fact, this is a compilation of the regular expression language into the programming language code.



For example, flex is one of these analyzers. Old, but proven over the years.



I used a lot of flex, it has both good and bad sides, but by and large, I didn't have to complain.

')

But yesterday I came across an interesting project - re2c . In fact, on this thing you can write lexical analyzers directly on your knee in a few minutes.





Suppose you need to select from the line some commands, integers and fractional numbers. You can uncover flex, and you can write like this:



#include <stdio.h> #include <stdlib.h> enum { CMD, INT, FLOAT, SPACE, END }; int scan(char** p, char** lex) { char* marker; if (lex) *lex = *p; /*!re2c re2c:define:YYCTYPE = "unsigned char"; re2c:define:YYCURSOR = *p; re2c:define:YYMARKER = marker; re2c:yyfill:enable = 0; re2c:yych:conversion = 1; re2c:indent:top = 1; "GET"|"PUT"|"EXIT" { return CMD; } [0-9]+ { return INT; } [0-9]+ '.' [0-9]* { return FLOAT; } [ \t]+ { return SPACE; } [^] { return END; } */ } int main(int argc, char* argv[]) { char *p, *last; int token; if (argc < 2) return 1; p = argv[1]; while ((token = scan(&p, &last)) != END) { int sz = p - last; switch (token) { case CMD: printf("Command: '%.*s'\n", sz, last); break; case INT: printf("Number: '%.*s'\n", sz, last); break; case FLOAT: printf("Float: '%.*s'\n", sz, last); break; } } return 0; } 


And that's it!



It is clear that all the magic happens in the “scan ()” function between the lines limited by the comments “/ *! Re2c” and “* /”.



So, re2c is a regular expression compiler that embeds code directly into the program text.



If we run our source through re2c:



  re2c.exe -is test.re2c >test.c 


So we get this:



 /* Generated by re2c 0.13.5 on Tue Apr 19 21:08:57 2011 */ #include <stdio.h> #include <stdlib.h> enum { CMD, INT, FLOAT, SPACE, END }; int scan(char** p, char** lex) { char* marker; if (lex) *lex = *p; { unsigned char yych; yych = (unsigned char)**p; if (yych <= '9') { if (yych <= 0x1F) { if (yych == '\t') goto yy8; goto yy10; } else { if (yych <= ' ') goto yy8; if (yych <= '/') goto yy10; goto yy6; } } else { if (yych <= 'F') { if (yych == 'E') goto yy5; goto yy10; } else { if (yych <= 'G') goto yy2; if (yych == 'P') goto yy4; goto yy10; } } yy2: yych = (unsigned char)*(marker = ++*p); if (yych == 'E') goto yy24; yy3: { return END; } yy4: yych = (unsigned char)*(marker = ++*p); if (yych == 'U') goto yy23; goto yy3; yy5: yych = (unsigned char)*(marker = ++*p); if (yych == 'X') goto yy18; goto yy3; yy6: ++*p; if ((yych = (unsigned char)**p) == '.') goto yy13; if (yych <= '/') goto yy7; if (yych <= '9') goto yy16; yy7: { return INT; } yy8: ++*p; yych = (unsigned char)**p; goto yy12; yy9: { return SPACE; } yy10: yych = (unsigned char)*++*p; goto yy3; yy11: ++*p; yych = (unsigned char)**p; yy12: if (yych == '\t') goto yy11; if (yych == ' ') goto yy11; goto yy9; yy13: ++*p; yych = (unsigned char)**p; if (yych <= '/') goto yy15; if (yych <= '9') goto yy13; yy15: { return FLOAT; } yy16: ++*p; yych = (unsigned char)**p; if (yych == '.') goto yy13; if (yych <= '/') goto yy7; if (yych <= '9') goto yy16; goto yy7; yy18: yych = (unsigned char)*++*p; if (yych == 'I') goto yy20; yy19: *p = marker; goto yy3; yy20: yych = (unsigned char)*++*p; if (yych != 'T') goto yy19; yy21: ++*p; { return CMD; } yy23: yych = (unsigned char)*++*p; if (yych == 'T') goto yy21; goto yy19; yy24: ++*p; if ((yych = (unsigned char)**p) == 'T') goto yy21; goto yy19; } } int main(int argc, char* argv[]) { char *p, *last; int token; if (argc < 2) return 1; p = argv[1]; while ((token = scan(&p, &last)) != END) { int sz = p - last; switch (token) { case CMD: printf("Command: '%.*s'\n", sz, last); break; case INT: printf("Number: '%.*s'\n", sz, last); break; case FLOAT: printf("Float: '%.*s'\n", sz, last); break; } } return 0; } 


Fearfully? Yes, the code is not for manual editing, but this is not required.



Compile:



  re2c.exe -is test.re2c >test.c && cl test.c 


Run:



  test "GET 123.0 12344 PUT 10." 


Result:



  Command: 'GET' Float: '123.0' Number: '12344' Command: 'PUT' Float: '10.' 


As they say, fast, cheap and angry. In order to fully master re2c, one and only documentation page should be read.



By the way, the simplicity of working with re2c does not mean that you cannot make complex analyzers on it. The distribution has examples for the grammar of the tokens of the languages C and Rexx .



If you play with re2c flags , then instead of “if / else” you can generate code based on “switch / case”. The choice should be made on the basis of understanding which code your compiler optimizes better.



As I understand it, the analyzer generated by re2c should be very fast, even very fast. Although it would be interesting to measure it against the same flex, ANTLR 'a or Spirit ' a.



Posts almost on the topic:

Source: https://habr.com/ru/post/117843/



All Articles