Screwdriver for expressions

The scope of the analysis of mathematical expressions is not difficult to imagine - these are all sorts of SQL query parsers, and handlers for formulas entered by the user (the same construction of graphs or filters to the database) - up to the creation of your own languages (I do not intentionally write the word “programming”, k. often these are data description languages and their ilk).

Perhaps I’m wrong, but I couldn’t find a more or less usable expression parser for PHP on the network’s open spaces — and, as those who read my articles periodically, I’m used to doing it myself, that’s to invent a bicycle. : ^)

The result of my attempts you can find here . In the archive you will find the scripts necessary for the functioning of the library, and an example of its work (sample.php). The library is compiled as standalone.
')
But, I believe, it would be interesting to find out what is what.

As it is easy to guess, the library is built on the basis of HMOs (reverse Polish notation, or notation, if you are so familiar). However, the HMO made sense to modify to the necessary condition, because in its description only the basic functionality for parsing expressions is given. So, here is the library operating scheme:

The description of expression parsing rules is performed in a class inherited from the CIpr class. Grammars and classes are defined in the protected p_build method.

First you need to define grammatical classes (or grammars ), on the basis of which the parser will divide the string passed to it into lexemes. Each lexeme consists of a token (the actual word) and a class identifier (the one that caused the selection of this lexeme). The class identifier is the array index, i.e. it can be a string or an integer (I used strings for clarity). I defined the following possible grammar types:

grammar_char (characters [, combinations])
These grammars are defined by a character set (a string in which these characters are listed) and, possibly, combinations of characters (an array of strings).
The definition looks like this: grammar_char ("+ - * / =", array ("+ =", "++")) . This grammar defines single characters +, -, *, / and = and their combinations + = and ++.

grammar_list (tokens [, case sensitivity = true])
This type of grammar is defined by predefined sets of tokens. You can also determine whether to take a case into account when defining a token (by default, yes).
You can use it like this: grammar_list (array (“prefix1”, “prefix2”), false) . This grammar defines case-insensitive prefix1 and prefix2 tokens. It is important that it will be easy to search for one token from the list at the beginning of the line, without taking into account separators , i.e. the prefix1 lexeme will be selected from the prefix1something expression.

grammar_preg (nachlo [, extract])
This type of grammar, along with character, I think will be most useful. These are grammars based on regular expressions. They are described by a regular start and, possibly, a regular one for extraction. If no extract is specified, then the beginning is used instead. Here is a regular expression, for example, selects comments like // comment : /^\/\/.*/

grammar_quot (quotation mark [, escape mode = C_ESCAPE_CHAR])
Grammars of this type are quoted strings. The screening mode can take the following values:
C_ESCAPE_NONE - characters are not escaped in the string. (The string "10" "will be recognized as 10 \).
C_ESCAPE_CHAR - only escape characters can be escaped. (The string "10 \ '- \" "will be recognized as 10 \' -").
C_ESCAPE_ALL - the escape character \ can be put in front of all the characters. (The string "1 \ 0 \ '- \" "will be recognized as 10'-").

grammar_brac (brackets characters [, shielding mode = C_ESCAPE_ALL [, <allow nesting = true>]])
These are bracket grammars (i.e., highlighting expressions like <anfg>). Their use is personally doubtful to me in this context - they are intended for other purposes (this library is only part of the platform). But you can use them if you need them. : ^)

grammar_user (start, extract)
This is a custom grammar defined by the start and extraction functions.

Grammar analysis takes place as follows: The input string is checked in turn by all grammars, and if the beginning of one of them is detected (as a rule, coinciding with the beginning of the line), then it is extracted. As a result of grammatical analysis, you get an array of tokens.

Further, what is more interesting, the classes of possible commands are defined in the expression code. You can define the following types of commands:

ipr_lexeme (token)
Specifies that the specified token should be processed as a command of the type "token". In the role of a token, it can act as a full-fledged token, defined through the functions lexeme and ilexeme, or simply a string defining the token of a token.

ipr_operator (lexeme, priority [, arity = 2 [, associativity = not (which is now equivalent to the left, which is not quite right)]])
Defines an operator.
You can also use functions:
ipr_perfix - prefix unary operator;
ipr_postfix - postfix unary operator;
ipr_binary is a binary operator.

ipr_compound (array of tokens, priority [, arity = 3 [, associativity = right [, automatically calculate = no]]])
Defines a complex statement. What is the difference from the simple? The operands of a simple operator are calculated by the engine itself, which, let's say, for the operator?: Is not entirely correct, because you only need to calculate one of its second and third operands. For this, a complex operator is needed - only the first operand is automatically calculated for it, and all subsequent ones are passed to the user method for processing. Thus, it makes sense to define complex operators with a degree of at least 2.
You can also use functions:
ipr_access - binary operator of access to the object field;
ipr_ternary is a ternary operator.

ipr_brackets (opening token, closing token, comma [, can be used without operand = no [, arity = not defined [, calculate automatically = yes]]])
Defines brackets. The brackets can be used without an operand (for example (a + b)) or with an operand (for example, arr [10]). The arity of brackets can also be defined when they need to specify a strictly defined number of operands. Pro automatic can be read above.
You can also use functions:
ipr_default - definition of brackets used to set priority (i.e. default brackets).

ipr_instruction (lexeme, opening lexeme, closing lexeme, comma [, arity = not defined [, calculate automatically = no]])
Specifies the instruction. In fact, the same as brackets with the prohibition of auto-use, only instead of the operand in front of the brackets, the instruction token is used.

ipr_end (token)
Defines a token on which the expression is considered to have ended.

ipr_until (token)
Defines a token on which the expression is considered to have ended. In this case, the lexeme remains at the beginning of the remainder of the expression.

Now you need to define handlers for all types of commands, of which 5 are an operand, a token, an operator, a complex operator, brackets, and an instruction. These are the following protected methods:

p_run_operand (command)
p_run_lexeme (command)
p_run_operator (command, operands)
p_run_compound (command, operand, operands)
p_run_brackets (command, operand, operands)
p_run_instruction (command, operands)

The command contains the type of command (not important, since it defines the handler, that is, the situation when the operator’s command comes to the operand handler is impossible), the command class and the token that was defined as a command. The operand contains the computed operand, and the operands contain the operands, either computed or pending processing via the p_run method (depending on the type of command and the "calculate automatically" flag).

Now, to create an expression handler, you need to create an instance of the handler class, to which an expression is passed to the constructor for processing. The constructor will parse and compile the expression. To run an expression, use the run method.

Well, that seems to be all. Error handling and non-associative operators are on the way now - I hope I’ll finish all this in the near future. As usual, I ask you not to be too strict for those to whom this is no longer newly - I publish my developments for those to whom they may come in handy, because the developments, it seems to me, are not entirely useless.

Thank you for your attention! I would be very happy to hear your feedback and comments in the comments, regardless of their positivity. : ^)

Source: https://habr.com/ru/post/77481/

All Articles

Screwdriver for expressions

More articles: