📜 ⬆️ ⬇️

Add range operator in PHP

image
In the picture - Ancient Psychic Tandem War Elephant © Adventure Time

This article will discuss the process of implementing a new operator in PHP. To do this, the following steps will be performed:


In general, this article will briefly review a few internal aspects of PHP. I express my deep gratitude to Nikita Popov for help in finalizing this article.

Range operator


This is the name of the operator that we will add to PHP. It is denoted by two characters: |>. For the sake of simplicity, its semantics will be defined as follows:
')
  1. The increment step will always be equal to one.
  2. Operands can be integer or floating point.
  3. If min = max, then a singleton array containing min will be returned.

These points will be discussed in the last section, “Updating the Zend Virtual Machine,” when we introduce semantics.

If at least one item is not executed, then the Error exception will be thrown. I.e:

If the operand is not integer or floating point.
If min> max.
If the range (max - min) is too large.

Examples:

 1 |> 3; // [1, 2, 3] 2.5 |> 5; // [2.5, 3.5, 4.5] $a = $b = 1; $a |> $b; // [1] 2 |> 1; // Error exception 1 |> '1'; // Error exception new StdClass |> 1; // Error exception 

Update lexical analyzer


First, you need to register a new token in the analyzer. This is necessary so that when selecting tokens from the source code to the token T_RANGE , |> is returned. For this you have to update the file Zend / zend_language_scanner.l . Add the following code to it (in the section where all tokens are declared, approximately the 1200th line):

 <ST_IN_SCRIPTING>"|>" { RETURN_TOKEN(T_RANGE); } 

The analyzer is now in ST_IN_SCRIPTING mode. This means that it will only define a sequence of |> characters. Between braces is a code on C, which will be executed when it detects |> in the source code. In this example, the T_RANGE token is T_RANGE .

Retreat. If we modify the lexical analyzer, then for its regeneration we need Re2c. For normal PHP builds, this dependency is not needed.

The T_RANGE identifier must be declared in the Zend / zend_language_parser.y file . To do this, add to the end of the section where the remaining token identifiers are declared (approximately line 220):

 %token T_RANGE "|> (T_RANGE)" 

Now PHP recognizes the new statement:

 1 |> 2; // Parse error: syntax error, unexpected '|>' (T_RANGE) in... 

But since its use is not described, we get a parsing error. In the next part we will fix it.

Now we need to regenerate the ext / tokenizer / tokenizer_data.c file as a tokenizer extension in order to be able to work with the new token. This extension simply provides an interface between the analyzer and the user environment via the token_get_all and token_name . At the moment he is in happy ignorance regarding the T_RANGE token:

 echo token_name(token_get_all('<?php 1|>2;')[2][0]); // UNKNOWN 

To regenerate ext / tokenizer / tokenizer_data.c, go to the ext / tokenizer folder and execute the tokenizer_data_gen.sh file. Then go back to the root folder of php-src and rebuild PHP. Check the extension of the tokenizer:

 echo token_name(token_get_all('<?php 1|>2;')[2][0]); // T_RANGE 

Parser update


The parser needs to be updated so that it can check where the new T_RANGE token is used in the PHP scripts. Also the parser is responsible for:


All this is done using the Zend / zend_language_parser.y grammar file, which contains token declarations and production rules that Bison will use to generate the parser.

Retreat . Priority sets the rules for grouping expressions. For example, in the expression 3 + 4 * 2, the * character has a higher priority than +, therefore the expression will be grouped as 3 + (4 * 2).

Associativity describes the behavior of an operator during chain building: whether the operator can be embedded in the chain, and if so, how it will be grouped within a specific expression. Suppose a ternary operator has left-sided associativity, then it will be grouped and executed from left to right. That is the expression

1 ? 0 : 1 ? 0 : 1; // 1

will be executed as

(1 ? 0 : 1) ? 0 : 1; // 1

If we correct this and prescribe right-sided associativity, the expression will be executed as follows:

$a = 1 ? 0 : (1 ? 0 : 1); // 0

There are non-associative operators that cannot be embedded in chains at all. Let's say the> operator. So this expression will be erroneous:

1 < $a < 2;

Since the range operator will perform calculations in an array, it will be meaningless to use it as an operand for yourself (for example, 1 |> 3 |> 5). So let's make it non-associative. At the same time, we assign it the same priority as the combined comparison operator ( T_SPACESHIP ). This is done by adding the token T_RANGE to the end of the next line (approximately 70th):

 %nonassoc T_IS_EQUAL T_IS_NOT_EQUAL T_IS_IDENTICAL T_IS_NOT_IDENTICAL T_SPACESHIP T_RANGE 

Now, to work with the new operator, you need to update the rule expr_without_variable . Add the following code to it (for example, right before the rule T_SPACESHIP , approximately the 930th line):

 | expr T_RANGE expr { $$ = zend_ast_create(ZEND_AST_RANGE, $1, $3); } 

Symbol | used as or . This means that any of the listed rules may comply. If a match is found, the code inside the curly brackets will be executed. $$ refers to the result node where the value of the expression is stored. The zend_ast_create function zend_ast_create used to create our AST node for a new operator. The node name is ZEND_AST_RANGE , it contains two values: $ 1 refers to the left operand ( expr T_RANGE expr), $ 3 refers to the right operand (expr T_RANGE expr ).

Now we need to set the AST constant ZEND_AST_RANGE . To do this, update the Zend / zend_ast.h file by simply adding a constant under the list of two child nodes (for example, under ZEND_AST_COALESCE ):

 ZEND_AST_RANGE, 

Now the execution of our range operator will only suspend the interpreter:

 1 |> 2; 

Compile Update


As a result of the parser, we get the AST tree, which is then viewed in reverse order. Initialization of the execution of functions is carried out as you visit each node of the tree. The initialized functions send operation codes that are later interpreted by the Zend virtual machine.

Compilation is done in Zend / zend_compile.c . Let's add the name of our new AST node ( ZEND_AST_RANGE ) to the large branch operator in the zend_compile_expr function (for example, immediately after ZEND_AST_COALESCE , roughly the ZEND_AST_COALESCE line):

  case ZEND_AST_RANGE: zend_compile_range(result, ast); return; 

Now somewhere in the same file you need to declare the function zend_compile_range :

 void zend_compile_range(znode *result, zend_ast *ast) /* {{{ */ { zend_ast *left_ast = ast->child[0]; zend_ast *right_ast = ast->child[1]; znode left_node, right_node; zend_compile_expr(&left_node, left_ast); zend_compile_expr(&right_node, right_ast); zend_emit_op_tmp(result, ZEND_RANGE, &left_node, &right_node); } /* }}} */ 

Let's start with dereferencing the left and right operands of the ZEND_AST_RANGE node into the left_ast and right_ast pointer right_ast . Next, we declare two znode variables in which the result of compiling the AST nodes of each of the two operands will be stored. This is the recursive part of processing the tree and compiling its nodes into opcodes.

Now, using the zend_emit_op_tmp function, zend_emit_op_tmp generate the ZEND_RANGE with its two operands.

Let's briefly discuss the operation codes and their types in order to better understand the meaning of using the zend_emit_op_tmp function.

Operation codes are instructions that are executed by the virtual machine. Each of them has:


Retreat . Opcodes for PHP scripts can be found using:

  • PHPDBG: sapi/phpdbg/phpdbg -np* program.php
  • Opcache
  • Vulcan Logic Disassembler (VLD) Extensions: sapi/cli/php -dvld.active=1 program.php
  • If the script is short and simple, then you can use 3v4l

znode_op nodes ( znode_op structures) can be of different types:


This brings us back to the function zend_emit_op_tmp , which will generate a zend_op type IS_TMP_VAR . We need this because our operator will be an expression, and the value (array) produced by it will be a temporary variable that can be used as an operand for another opcode (for example, ASSIGN from $var = 1 |> 3; ).

Zend Virtual Machine Update


To handle the execution of our new opcode, you need to update the virtual machine. This involves updating the zend / zend_vm_def.h file . Add to the very end:

 ZEND_VM_HANDLER(182, ZEND_RANGE, CONST|TMP|VAR|CV, CONST|TMP|VAR|CV) { USE_OPLINE zend_free_op free_op1, free_op2; zval *op1, *op2, *result, tmp; SAVE_OPLINE(); op1 = GET_OP1_ZVAL_PTR_DEREF(BP_VAR_R); op2 = GET_OP2_ZVAL_PTR_DEREF(BP_VAR_R); result = EX_VAR(opline->result.var); // if both operands are integers if (Z_TYPE_P(op1) == IS_LONG && Z_TYPE_P(op2) == IS_LONG) { // for when min and max are integers } else if ( // if both operands are either integers or doubles (Z_TYPE_P(op1) == IS_LONG || Z_TYPE_P(op1) == IS_DOUBLE) && (Z_TYPE_P(op2) == IS_LONG || Z_TYPE_P(op2) == IS_DOUBLE) ) { // for when min and max are either integers or floats } else { // for when min and max are neither integers nor floats } FREE_OP1(); FREE_OP2(); ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION(); } 

The opcode number should be one greater than the previous maximum value, so you can take 182. To quickly find out the last maximum number, look in the Zend / zend_vm_opcodes.h file , there is a constant ZEND_VM_LAST_OPCODE at the end.

Retreat . The above code contains several pseudo- USE_OPLINE ( USE_OPLINE and GET_OP1_ZVAL_PTR_DEREF ). These are not real C-macros during the generation of the virtual machine, they are replaced by the Zend / zend_vm_gen.php script , unlike the procedure performed by the preprocessor during the compilation of the source code. So if you want to see their definitions, then refer to the Zend / zend_vm_gen.php file .

The ZEND_VM_HANDLER contains the definition of each opcode. It can have five parameters:

  1. Opcode number (182).
  2. Opcode name ( ZEND_RANGE ).
  3. The correct types of the left operand (CONST | TMP | VAR | CV) (see $vm_op_decode in Zend / zend_vm_gen.php ).
  4. The correct types of the right operand (CONST | TMP | VAR | CV) (see $vm_op_decode in Zend / zend_vm_gen.php ).
  5. An optional extended value flag for overloaded codes (see $vm_ext_decode at Zend / zend_vm_gen.php ).

Given the above, we can see:

 // CONST enables for 1 |> 5.0; // TMP enables for (2**2) |> (1 + 3); // VAR enables for $cmplx->var |> $var[1]; // CV enables for $a |> $b; 

Retreat . If one or both operands are not used, they are marked with ANY.

Retreat . TMPVAR appeared in ZE 3. It handles the same types of opcode nodes as TMP|VAR , but generates different code. TMPVAR generates one method for processing TMP and VAR , which reduces the size of the virtual machine, but requires more conditional logic. And TMP|VAR generates separate methods for processing TMP and VAR , which increases the size of the virtual machine, but requires less conditional structures.

We turn to the "body" of our definition of opcode. We start by calling the USE_OPLINE pseudo- USE_OPLINE to declare the variable opline (zend_op structure). It will be used to read operands (using pseudo- GET_OP1_ZVAL_PTR_DEREF like GET_OP1_ZVAL_PTR_DEREF ) and prescribe the return value of the opcode.

Next, we declare two variables zend_free_op . These are simple zval pointers declared for each operand we use. They are needed during the test, if an operand needs release. Then we declare four zval. op1 variables zval. op1 zval. op1 and op2 pointers to these zval 's, they contain operand values. We declare the result variable to store the results of the opcode operation. Finally, we declare tmp to store the intermediate value of a looping operation in a range (range looping operation). This value will be copied to the hash table at each iteration.

The variables op1 and op2 initialized with the pseudo- GET_OP1_ZVAL_PTR_DEREF and GET_OP2_ZVAL_PTR_DEREF . Also, these macros are responsible for initializing the variables free_op1 and free_op2 . The constant BP_VAR_R passed to the above macros is a type flag. Its name stands for BackPatching Variable Read , which is used when reading compiled variables . And in the end we dereference opline and assign result its value for further use.

Now let's fill in the “body” of the first if , provided that min and max are integers:

 zend_long min = Z_LVAL_P(op1), max = Z_LVAL_P(op2); zend_ulong size, i; if (min > max) { zend_throw_error(NULL, "Min should be less than (or equal to) max"); HANDLE_EXCEPTION(); } // calculate size (one less than the total size for an inclusive range) size = max - min; // the size cannot be greater than or equal to HT_MAX_SIZE // HT_MAX_SIZE - 1 takes into account the inclusive range size if (size >= HT_MAX_SIZE - 1) { zend_throw_error(NULL, "Range size is too large"); HANDLE_EXCEPTION(); } // increment the size to take into account the inclusive range ++size; // set the zval type to be a long Z_TYPE_INFO(tmp) = IS_LONG; // initialise the array to a given size array_init_size(result, size); zend_hash_real_init(Z_ARRVAL_P(result), 1); ZEND_HASH_FILL_PACKED(Z_ARRVAL_P(result)) { for (i = 0; i < size; ++i) { Z_LVAL(tmp) = min + i; ZEND_HASH_FILL_ADD(&tmp); } } ZEND_HASH_FILL_END(); ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION(); 

We begin by defining the variables min and max . They are declared as zend_long , which should be used when declaring long integers (just like zend_ulong used to define long integers without a sign). The size is then declared using zend_ulong , which contains the size of the array to be generated.

Next, a check is performed: if min > max , an Error exception is thrown. If you pass Null as the first argument in zend_throw_error , the default exception class is Error . With inheritance, you can fine-tune this exception by creating a new class entry in Zend / zend_exceptions.c . But we will talk more about this another time. If this exception occurs, we call the pseudo- HANDLE_EXCEPTION , which proceeds to the execution of the next opcode.

Now we calculate the size of the array to be generated. It should be one less than the actual size, since there is a chance of overflow if min = ZEND_LONG_MIN (PHP_INT_MIN) and max = ZEND_LONG_MAX (PHP_INT_MAX) .

After that, the calculated size is compared with the constant HT_MAX_SIZE to make sure that the array HT_MAX_SIZE into the hash table. The total size of the array must not be greater than or equal to HT_MAX_SIZE . Otherwise, we again generate an Error exception and exit the virtual machine.

We know that HT_MAX_SIZE = INT_MAX + 1 . If the resulting value is greater than size , then we can increase the latter without fear of overflow. This is what we take as the next step so that the size value matches the size of the range.

Now we change the type of the zval tmp IS_LONG . Then, using the macro array_init_size initialize result . This macro assigns the result' type IS_ARRAY_EX , allocates memory for the zend_array structure (hash table), and sets up the corresponding hash table. Then, the zend_hash_real_init function allocates memory for the Bucket structures containing each element of the array. The second argument, 1, indicates that we want to make it a packed hash table (packed hashtable).

Retreat . A packed hash table is essentially an actual array, that is, an array that is accessed using integer keys (as opposed to typical associative arrays in PHP). This optimization was implemented in PHP 7. The reason for this innovation is that in PHP, many arrays are indexed with integers (keys in ascending order). Packed hash tables provide direct access to the hash table pool. If you are interested in the details of the new implementation of hash tables, then refer to the article by Nikita .

Retreat . The _zend_array structure has two zend_array : zend_array and HashTable .

Fill the array with the macro ZEND_HASH_FILL_PACKED ( definition ), which essentially keeps track of the current bucket for later insertion. During array generation, the intermediate result (array element) is stored in zval'e tmp . The macro ZEND_HASH_FILL_ADD creates a copy of tmp , inserts it into the current bucket of the hash table, and proceeds to the next bucket for the next iteration.

Finally, the ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION macro (appeared in ZE 3 as a replacement for the individual CHECK_EXCEPTION() and ZEND_VM_NEXT_OPCODE() calls embedded in ZE 2) checks if an exception has occurred. It did not occur, and the virtual machine moves to the next opcode.

Let's now consider the else if block:

 long double min, max, size, i; if (Z_TYPE_P(op1) == IS_LONG) { min = (long double) Z_LVAL_P(op1); max = (long double) Z_DVAL_P(op2); } else if (Z_TYPE_P(op2) == IS_LONG) { min = (long double) Z_DVAL_P(op1); max = (long double) Z_LVAL_P(op2); } else { min = (long double) Z_DVAL_P(op1); max = (long double) Z_DVAL_P(op2); } if (min > max) { zend_throw_error(NULL, "Min should be less than (or equal to) max"); HANDLE_EXCEPTION(); } size = max - min; if (size >= HT_MAX_SIZE - 1) { zend_throw_error(NULL, "Range size is too large"); HANDLE_EXCEPTION(); } // we cast the size to an integer to get rid of the decimal places, // since we only care about whole number sizes size = (int) size + 1; Z_TYPE_INFO(tmp) = IS_DOUBLE; array_init_size(result, size); zend_hash_real_init(Z_ARRVAL_P(result), 1); ZEND_HASH_FILL_PACKED(Z_ARRVAL_P(result)) { for (i = 0; i < size; ++i) { Z_DVAL(tmp) = min + i; ZEND_HASH_FILL_ADD(&tmp); } } ZEND_HASH_FILL_END(); ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION(); 

Retreat . We use long double in cases where simultaneous use of integer operands and floating point is possible. The fact is that double precision is only 53 bits, so when using this type, any integer greater than 2 53 will not be accurately represented. And a long double accuracy of at least 64 bits, so that it allows you to accurately use 64-bit integers.

The above code is very similar in logic to the previous one. The main difference is that we now process the data as floating-point numbers. It belongs:

  1. to get them from the macro Z_DVAL_P ,
  2. assignment of type IS_DOUBLE tmp ,
  3. and also to insert zval'a (type double) with the help of macro Z_DVAL .

Finally, we need to handle cases in which either min , max , or both are neither integer nor floating point. As stated in the second paragraph of the semantics of our range operator, only integer and floating point are supported as operands. In all other cases, the exception Error should be thrown. Let's insert the following code in the else block:

 zend_throw_error(NULL, "Unsupported operand types - only ints and floats are supported"); HANDLE_EXCEPTION(); 

Now we have finished defining our opcode, it's time to regenerate the virtual machine. To do this, we run the Zend / zend_vm_gen.php file , and he will use the Zend / zend_vm_def.h file to regenerate Zend / zend_vm_opcodes.h , Zend / zend_vm_opcodes.c and Zend / zend_vm_execute.h .

We will rebuild PHP to make sure that our range operator works:

 var_dump(1 |> 1.5); var_dump(PHP_INT_MIN |> PHP_INT_MIN + 1); 

Output:

 array(1) { [0]=> float(1) } array(2) { [0]=> int(-9223372036854775808) [1]=> int(-9223372036854775807) } 

- ! . pretty printer AST ( ). Pretty printer , assert() :

 assert(1 |> 2); // segfaults 

. assert() pretty printer , . , ( pretty printer ). , PHP 7.

, Zend/zend_ast.c , ZEND_AST_RANGE . ( 520- ), 170 ( zend_language_parser.y):

 * 170 non-associative == != === !== |> 


ZEND_AST_RANGE zend_ast_export_ex case ( case ZEND_AST_GREATER ):

 case ZEND_AST_RANGE: BINARY_OP(" |> ", 170, 171, 171); case ZEND_AST_GREATER: BINARY_OP(" > ", 180, 181, 181); case ZEND_AST_GREATER_EQUAL: BINARY_OP(" >= ", 180, 181, 181); 

pretty printer assert() :

 assert(false && 1 |> 2); // Warning: assert(): assert(false && 1 |> 2) failed... 

Conclusion


, . , Zend PHP-, PHP . . , — , — , (, ).

Source: https://habr.com/ru/post/276331/


All Articles