Flex & utf8

“Long ago, it seems, last Friday,” I needed a lexical analyzer that could work with unicode data.

The builder of the lexical analyzer wanted to have Flex , and this turned out to be a whole problem.
By itself, Flex does not know how to work with Unicode data. when building an automaton, it is assumed that the characters are 7 or 8 bit.

I met flex-2.5.4a-unicode-patch , but only for 16-bit characters and a specific version with all the consequences.

Meanwhile, there is a simple and quite workable solution that does not require you to ~~use unwashed hands on the holy of holies to~~ reassemble the tools.
')
Announcing

%option 8bit %option c++ ... alpha [A-Za-z] U1 [\x80-\xbf] U2 [\xc2-\xdf] U3 [\xe0-\xef] U4 [\xf0-\xf4] ualpha {alpha}|{U2}{U1}|{U3}{U1}{U1}|{U4}{U1}{U1}{U1} uname ({ualpha}|\_)* ...

and voilà ... can be enjoyed.

 %% ... {uname} { ... yylval.str_ = std::string(yytext); return XyzParser::ttName; }

Source: https://habr.com/ru/post/192556/

All Articles

Flex & utf8

More articles: