Regular Expressions For All (REFA)
main idea
There are many systems to search for substrings that match a particular mask. Unfortunately, they lose their power as many factors have to be taken into account. Constructions become cumbersome, incomprehensible and difficult to maintain.
That is why I tried to create an analogue - REFA. Regular expressions for everyone.
His idea is as follows. As soon as the regular expression ceases to be obvious - break it into two. The optimizer, if possible, will still reduce it to one, so there will be no loss in speed, but the code will become clearer.
For easy reading
ge.tt/9snPkzG/v/0 (format \ .odt)
Examples
Search C ++ functions
Find the implementation of all methods of the dummy class.
It is believed that the input is a large line with all the project code. You can read from a file, but it makes it difficult to understand the example.
PROGRAM âFindMethodsâ ^name^ = ~\w?[\w|\d]*~ BLOCK âFindClassâ // PUSH BLOCKVAR $regexp = âclass â+%classname%+â\s*\{.*\}.*;â MATCH $regexp CATCH MATCH_FAIL RETURN array() AS $list; RETURN array() AS $result; FINISH BLOCKVAR $class_code = MATCHED INCOMING = $class_code BLOCKVAR $method = ^name^+~\w*~+^name^+~\([\^name\^\w*\^name\^\w*\,?]*\)\w*~ BLOCKVAR $declarations = array(); BLOCKVAR $realisations = array(); TRY WHILE 1 MATCH PASS LIMIT 1 $method IF select(0,1) INCOMING != â;â CALL âSearchEndOfFunctionâ REMAINED $realisations ADD (MATCHED + RESULT $body) ELSE $declarations ADD MATCHED ENDIF END ON MATCH_FAIL OR END_OF_STRING RETURN $declarations AS $list RETURN $realisations AS $result FINISH POP ENDBLOCK BLOCK âSearchEndOfFunctionâ BLOCKVAR UINT $level = 0 MATCH ~[\{|\}]~ FOREACH ALL_MATCHED AS $t IF $t == â{} $level++; ELSE $level--; ENDIF IF $level == 0 BLOCKVAR STRING $ret = select(ALL_MATCHED[0], ALL_MATCHED[ITERATION]) INCOMING_BLOCK RETURN $ret AS $body ENDIF END ENDBLOCK BLOCK âAddClassNameâ MATCH PASS LIMIT 1 ^name^+â\w*â BLOCKVAR $ret = MATCHED $ret += â[\^name\^\w*::\w*]*â+%classname%+â\w*::\w*â $ret += REMAIN RETURN $ret ENDBLOCK BLOCK âSearchDeclaredFunctionsâ BLOCKVAR $dec = %declared% IMPLODE ($dec, â|â) $string $string = â[â+$string+â]â MATCH $string BLOCVAR $realistaions = array() FOREACH ALL_TILES as $tile IF ITERATION % 2 == 1 IF select(0,1) INCOMING != â;â CALL âSearchEndOfFunctionâ ALL_TILES[ITERATION + 1] $realisations ADD (ALL_TILES[ITERATION] + RESULT $body) ENDIF ENDIF END RETURN $realisations AS $result ENDBLOCK // BLOCKVAR $classname = $arg1 CALL âFindClassâ BLOCKVAR $ret = RESULT $result BLOCKVAR $declared = RESULT $list CALL âSearchDeclaredFunctionsâ $ret ADD RESULT $result RETURN $ret ENDPROGRAM
The program was not very small, but at least more or less understandable. Regular expression is similar to this ... I do not advise.
Documentation
Data types
Int
The default type. Integer The range is â2 ^ 31 to + 2 ^ 31-1. The default value is 0.
LONG
Integer The range is â2 ^ 63 to + 2 ^ 63-1. The default value is 0
Uint
Ulong
STRING
Line. Maximum length UINT. Private fields START and COUNT.
The default value does not exist and causes an exception.
TILE
Part of the line. Private fields START, END, COUNT, PARENT_STRING.
Predefined Variables
INCOMING
The string to process. It is substituted if no variable is specified.
ICOMING is a synonym for INCOMING_CURRENT
- INCOMING_PROGRAM - come into the program
- INCOMING_BLOCK - arrived in the block
- INCOMING_CURRENT - current line
- INCOMING_LAST - until last modified
MATCHED
The first match came up in the last match.
ALL_MATCHED
An array with all the matches of the last expression.
REMAINED
First character after MATCHED
ALL_REMAINED
First characters after everyone in ALL_MATCHED
ALL_TILES
All odd is ALL_MATCHED. The rest is the missing lines, in the correct order before the line.
ITERATION
The iteration number in the current loop. To get the iteration number in the external - save to a separate variable.
Callstack
Call stack with parameters
QUERY_LOG
Log commands that influenced one way or another on the line. Be sure to remember copying lines (all of a sudden there was a subsequent processing) Incoming data stored in a single copy.
EXCEPTION_STRING
The line explains the essence of the error. Place of occurrence, incoming parameters, result.
Minimum set
Necessary for the simplest use of the system
Match
MATCH [IGNORE {ignore_count | FIRST}] [PASS] [LIMIT {limit_count}] reg_exp [processing_string]
Check reg_exp, shift START to processing_string in MATCHED (default)
IGNORE - skip the first few matches. Default IGNORE 0
PASS - move START to the last ALL_REMAINED
LIMIT - the maximum number of matches, after which the subroutine will terminate. By default, LIMIT 0 means it will work until the end of the file.
reg_exp - may be a regular expression specified between ~, may be a variable.
processing_string - the string to process. Default INCOMING
Echo
ECHO string
Output a string to the result.
The simplest example of replacing a regular expression:
MATCH PASS ~ some_regexp ~
FOREACH ALL_TILES AS $ tile
IF ITERATION% 2
// replace all matched pieces with a string
ECHO âREPLACEDâ
ELSE
// all the pieces between the matched ones will be returned unchanged
Echo $ tile
ENDIF
END
IF ELSE ENDIF
IF expr then [ELSE else] ENDIF
If the expr expression is not zero, then the code will be executed then, otherwise else
Extended set
PROGRAM
A program is an atomic set of executable commands that perform the necessary task. Only programs can have parameters other than âdefaultâ.
Generally speaking, this can be a separate process (or thread) and run in parallel. There is no way to turn from one program to another. But you can use (if they are declared) methods of the neighboring program. Programs can call programs.
The program is the scope for all blocks.
By default, all commands are enclosed in a program with a zero name (it cannot be called from other programs)
PROGRAM name arg0 [arg1 arg2 ...] code ENDPROGRAM
name - the name of the program
arg0 is the string to be processed. INCOMING_PROGRAM becomes
code - program code, including declarations.
Access to the code blocks is done with the help of the construction
program_name :: block_name.
BLOCK
BLOCK name [string]
Non-self code section. It is identical to two goto jumps. If string is specified, the corresponding INCOMING is changed before starting, after returning.
PUSH POP
Push [var1 var2]
Save the state of system variables. You can also add local variables for storage (by enumeration), and explicitly exclude some system variables using the! Var
POP - restores the state to the moment before the PUSH
BLOCKVAR
A temporary variable available only in the current scope, and destroyed on exit.
RETURN RESULT
Used to return the value of temporary variables from the block / program.
RETURN name
The RESULT name is used to access the variable in the calling construct.
The value is valid until the next block is called.
Error processing
During script execution, various exceptions are possible that should not affect the execution flow. For this there is a system of exceptions.
exceptions: exception_name [OR exception_name ...]
CATCH FINISH
CATCH exceptions code [CATCH exceptions code ...] FINISH
Required for catching an error that occurred on the previous line to the first CATCH block.
It is used in situations if an exceptional situation in this area is expected and is to be processed.
TRY ON FINISH
TRY code ON exceptions code [ON exceptions code ...] FINISH
THROW
THROW exception
Generate error manually
Types of errors
- MATCH_FAIL - could not find any regexp occurrences
- END_OF_STRING - the end was reached before something was found (implies MATCH_FAIL)
- WRONG_REGEXP - unable to compile regular expression
- VARIABLE_OVERFLOW - variable overflow
- UNSIGNED_NEGATIVE - entering a negative value in an unsigned number
- WRONG_STRING_INDEXES - an attempt to access a string by indices beyond the boundaries of the string
- OUT_OF_ARRAY - attempt to access non-existent array elements (outside)
')
Special constructions
~ regexp ~
Content - Regular Expression
% name%
At runtime, it will be replaced by a copy of the value of the $ name variable. (closest in stack)
# name #
Analogue define
^ name ^
A reference to a regular expression. Works inside ~~ like \ ^
^ hello ^ = ~ hel {2} o ~
~ \ ^ hello \ ^ world ~
Working with strings
array {tile} SPLIT (delimeter) [string]
tile SELECT (start, end) [string]
PASS (count) [& string]
CUT (count) [& string]
CUT_AFTER (index) [& string]
IMPLODE (array [, delimeter]) & string