📜 ⬆️ ⬇️

Decorators in PHP. Implementation of the extension

According to the results of the survey in the first article , it was decided to review the implementation of the expansion. At this point, the syntax has slightly changed for the existing IDE, which, perhaps, was the most negotiable point.
This is not another article-o-hello-world expansion, because Those who wish to understand the basics can easily find a lot of materials both on Habré itself and in the Russian-language RTFG .
An article about the prerequisites, implementation and pitfalls. It will be a little PHP, mainly C.


Prerequisites


If it is not interesting to read tl; dr, then you can go straight to Realization .

I'll start from afar
I like Python, especially some of its syntax. And since the last time I write mostly in PHP, I want the work tool to be more convenient and functional. PHP is developing well in recent versions, but there are still no such decorators. So you have to take everything into your own hands. First there was the idea of ​​unification in terms of names and passing the parameters of core functions ( another extension of mine , which is still at the stage of idea formation. If necessary, I can write about its creation ), now here are the decorators and some other things.

Function decorators (the methods are not mentioned here and hereafter, since they are no different for decoration) allow you to change the work of the latter, wrapping their calls with an additional layer of logic. In a declarative form, a list of wrappers is described, which in the end should return something that will replace the original function. In the python syntax, we get the following :
@decomaker(argA, argB, ...) def func(arg1, arg2, ...): # ... # : func = decomaker(argA, argB, ...)(func) 

There is no such possibility in PHP. At first, I decided to take this syntax as it is and transfer it unchanged (except for passing the parameters of the decorator when calling; see below). The first article describes exactly this syntax. However, the IDE with syntax checking and one of two camps of commentators made us think. As a result, the syntax is made more portable. Now the decorator's description should be described in a single-line comment #:

Having decided on the description format, you need to decide how the decorators themselves will be implemented. The decorator function must return a function that replaces the original decorating. This is where the anonymous functions and closures come in:
some PHP code
 <?php function decor($func) { echo "Decorator!\n"; return function () use($func) { return call_user_func_array($func, func_get_args()); }; } function a($v) { var_dump($v); } $b = decor('a'); $b(42); /* : Decorator! int(42) */ 

In PHP, a mediated function call with passing parameters to it is, of course, verbose, this cannot be taken away.

The result is a syntax from the image at the beginning of the article:
 <?php function decor($func) { return function(){} } # @decor function original() { // ... } 

And I want to get it without rewriting the Zend lexer , so that PHP itself does not have to be rebuilt (it works - do not touch it ).
')

Implementation


There are two options for doing this:

The second option looked dubious on the issue of compatibility with all sorts of opcode caches and optimizers. And the initial version of the decorators syntax (without the # comment) in this case would not work.
The first option was chosen.
Zend has two sources of source code “income”:

In both cases, we have pointers to functions with a specific implementation. Standard implementations can be found by looking at the initialization of pointers in zend_startup :

Both functions accept input in one form or another and source the array of opcode as _zend_op_array . Unfortunately, despite the similarity of the tasks performed, their implementation is different. So we will influence both.

The effect on similar function pointers in Zend and PHP extensions has been put on stream. For example, the same zend_compile_file is replaced in ZendAccelerator and phar . This is not counting third-party extensions.

For substitution, you only need to implement your analogue, and replace the pointer, retaining the original. Everything as usual.
It turns out about the following
 PHP_MINIT_FUNCTION(decorators); PHP_MSHUTDOWN_FUNCTION(decorators); zend_module_entry decorators_module_entry = { // ... decorators_functions, PHP_MINIT(decorators), PHP_MSHUTDOWN(decorators), // ... }; zend_op_array *(*decorators_orig_zend_compile_string)(zval *source_string, char *filename TSRMLS_DC); zend_op_array *(*decorators_orig_zend_compile_file)(zend_file_handle *file_handle, int type TSRMLS_DC); zend_op_array* decorators_zend_compile_string(zval *source_string, char *filename TSRMLS_DC); zend_op_array* decorators_zend_compile_file(zend_file_handle *file_handle, int type TSRMLS_DC); /* {{{ PHP_MINIT_FUNCTION */ PHP_MINIT_FUNCTION(decorators) { decorators_orig_zend_compile_string = zend_compile_string; zend_compile_string = decorators_zend_compile_string; decorators_orig_zend_compile_file = zend_compile_file; zend_compile_file = decorators_zend_compile_file; return SUCCESS; } /* }}} */ /* {{{ PHP_MSHUTDOWN_FUNCTION */ PHP_MSHUTDOWN_FUNCTION(decorators) { zend_compile_string = decorators_orig_zend_compile_string; zend_compile_file = decorators_orig_zend_compile_file; return SUCCESS; } /* }}} */ zend_op_array* decorators_zend_compile_string(zval *source_string, char *filename TSRMLS_DC) /* {{{ */ { return decorators_orig_zend_compile_string(source_string, filename TSRMLS_CC); } /* }}} */ zend_op_array* decorators_zend_compile_file(zend_file_handle *file_handle, int type TSRMLS_DC) /* {{{ */ { return decorators_orig_zend_compile_file(file_handle, type TSRMLS_CC); } /* }}} */ 

During the initialization of the module (our extension), the pointers were replaced, and upon completion of the work, they did not forget to return everything. Directly in the substituted functions we call the original implementation.
Not everything can be done during module initialization, but in our case this is quite enough.

And if everything is more or less clear with the compile_string (the input string comes to the input), then with compile_file everything is not so rosy - we don’t have the source code, only the source description in zend_file_handle . And in different cases, different sets of fields are used.
Direct reading of the source is buried pretty far.
 ZEND_API zend_op_array *compile_file(zend_file_handle *file_handle, int type TSRMLS_DC) { // ... open_file_for_scanning(file_handle TSRMLS_CC) // ... } ZEND_API int open_file_for_scanning(zend_file_handle *file_handle TSRMLS_DC) { // ... zend_stream_fixup(file_handle, &buf, &size TSRMLS_CC) // ... } 

And the most interesting thing for us here is zend_stream_fixup , a function that unifies all sources of input of the source code and outputs the read buffer and its size. That seems to be what we need, but we cannot influence the work of zend_stream_fixup and open_file_for_scanning, we only have control over compile_file. Someone went to copy-paste to himself these functions and all their dependencies, but we will make it easier. If you look at the source zend_stream_fixup, then you can see that all types are reduced to a single ZEND_HANDLE_MAPPED, which has the source code and its length in file_handle-> handle.stream.mmap.buf and file_handle-> handle.stream.mmap.len . Moreover, if this data type is already specified in file_handle, then almost nothing needs to be changed and everything is given as is.
It turns out that if we send zend_file_handle * file_handle in compile_file () already in the format ZEND_HANDLE_MAPPED with the correct value of all the fields, compile_file will accept this as it was. And we can do this by calling zend_stream_fixup (which is a function of the Zend API, and not a replaceable pointer) once more before the compile_file call. Then re-calling inside open_file_for_scanning just won't change anything.
We try
 zend_op_array* decorators_zend_compile_file(zend_file_handle *file_handle, int type TSRMLS_DC) /* {{{ */ { char *buf; size_t size; if (zend_stream_fixup(file_handle, &buf, &size TSRMLS_CC) == FAILURE) { return NULL; } //   file_handle    ZEND_HANDLE_MAPPED return decorators_orig_zend_compile_file(file_handle, type TSRMLS_CC); } /* }}} */ 

Hooray, it works. Moreover, we have the source file in file_handle-> handle.stream.mmap.buf / len, from where PHP would take it: stdin, fd, include http stream ... It remains to put our modified version of the code there and call the original zend_compile_file.

How decorators_preprocessor () does not work: I’ll get an obvious string, pass it to the preprocessor, and return the result string. Below, and so will the pieces of code from this function.

It remains to consider the preprocessor itself.
Transfer of separate raw data to a single function
 void preprocessor(zval *source_zv, zval *return_value TSRMLS_DC) { //    source_zv    return_value } /* {{{ DECORS_CALL_PREPROCESS */ #define DECORS_CALL_PREPROCESS(result_zv, buf, len) \ do { \ zval *source_zv; \ ALLOC_INIT_ZVAL(result_zv); \ ALLOC_INIT_ZVAL(source_zv); \ ZVAL_STRINGL(source_zv, (buf), (len), 1); \ preprocessor(source_zv, result_zv TSRMLS_CC); \ zval_dtor(source_zv); \ FREE_ZVAL(source_zv); \ } while (0); \ /* }}} */ /* {{{ proto string decorators_preprocessor(string $code) */ PHP_FUNCTION(decorators_preprocessor) { char *source; int source_len; zval *result; if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &source, &source_len) == FAILURE) { return; } DECORS_CALL_PREPROCESS(result, source, source_len); // ... } /* }}} */ zend_op_array* decorators_zend_compile_string(zval *source_string, char *filename TSRMLS_DC) /* {{{ */ { zval *result; DECORS_CALL_PREPROCESS(result, Z_STRVAL_P(source_string), Z_STRLEN_P(source_string)); // ... } /* }}} */ zend_op_array* decorators_zend_compile_file(zend_file_handle *file_handle, int type TSRMLS_DC) /* {{{ */ { // ... zval *result; DECORS_CALL_PREPROCESS(result, file_handle->handle.stream.mmap.buf, file_handle->handle.stream.mmap.len); // ... } /* }}} */ 


Find and redo!


The task of the preprocessor is to find descriptions of decorators and to modify the code of the functions that decorators influence. And for this it is best to work with source text tokens. In order not to reinvent the wheel, the native Zend lexical scanner lex_scan was used, an example of the use of which for its own purposes can be seen in the implementation of token_get_all and tokenize , called inside token_get_all.
  1. Save the current environment of the scanner in which our code works:
    zend_lex_state original_lex_state;
    zend_save_lexical_state (& original_lex_state TSRMLS_CC);

  2. Prepare the source line for parsing:
    zend_prepare_string_for_scanning (& source_z, "" TSRMLS_CC)

  3. Set the initial state of the lexer (all options here ):
    LANG_SCNG (yy_state) = yycST_IN_SCRIPTING;

    In contrast to token_get_all, we already parse the PHP code, so the presence of the opening tag is not necessary for us. Appropriately, the initial state is not yycINITIAL, but yycST_IN_SCRIPTING.
  4. In the loop, we get all the tokens of the source line:
    zval token_zv;
    int token_type;
    while (token_type = lex_scan (& token_zv TSRMLS_CC)) {
    // ...
    }

    token_type - token type:
    • <256 is the character code of a single-character token;
    • > = 256 - the value of the constant T_ * . The string description by token_type can be obtained via PHP_FUNCTION (token_name) / get_token_type_name.

    token_zv contains the lexeme value itself. However, as an alternative, you can use the yy_text and yy_leng fields of the zend_lex_state structure, which store the address of the first byte of the current token and its length, respectively. Access to these fields, like many things in Zend, is implemented through the appropriate macros:
    #define zendtext LANG_SCNG (yy_text)
    #define zendleng LANG_SCNG (yy_leng)

    Now we use char * zendtext and unsigned int zendleng.

    In order to avoid memory leak, you need to take into account that the token_zv value is sometimes taken as it is from the source buffer, and sometimes memory is allocated for it. Which needs to be released. Those who are interested can look at the lex_scan () code, but for now just take the necessary piece of logic from token_get_all.
  5. We restore the environment of the scanner in which our code works:
    zend_restore_lexical_state (& original_lex_state TSRMLS_CC);


Everything, we have a lexical analysis of the source code. But I would like to highlight some more points of parsing.

If PHP parses errors, the handler generates an error or exception, the file name and line number in the text of which are taken from the _zend_compiler_globals state. The file name, for example, is taken from the compiled_filename field. Which is set when calling zend_prepare_string_for_scanning (). It is used inside zend_error (used to generate any E_ * errors; it is also used in this extension to generate E_PARSE). But compiled_filename in zend_error () is used only if Zend is in the compile state (zend_bool in_compilation; everything is in the same _zend_compiler_globals). Which in itself is not activated if we parse the source.
So before parsing we switch to “compiling”:
zend_bool original_in_compilation = CG (in_compilation);
CG (in_compilation) = 1;

And at the end we return everything back:
CG (in_compilation) = original_in_compilation;

Now, if we pass the correct filename to zend_prepare_string_for_scanning, the possible errors will be much more informative. You can get the current file name via zend_get_compiled_filename (), which, however, can return NULL, from which php (if NULL is passed to zend_prepare_string_for_scanning) falls into segfault.
It remains to set the correct file name in decorators_preprocessor and decorators_zend_compile_file
 PHP_FUNCTION(decorators_preprocessor) { // ... char *prev_filename = zend_get_compiled_filename(TSRMLS_CC) ? zend_get_compiled_filename(TSRMLS_CC) : ""; zend_set_compiled_filename("-" TSRMLS_CC); DECORS_CALL_PREPROCESS(result, source, source_len); zend_set_compiled_filename(prev_filename TSRMLS_CC); // ... } zend_op_array* decorators_zend_compile_file(zend_file_handle *file_handle, int type TSRMLS_DC) /* {{{ */ { // ... char *prev_filename = zend_get_compiled_filename(TSRMLS_CC) ? zend_get_compiled_filename(TSRMLS_CC) : ""; const char* filename = (file_handle->opened_path) ? file_handle->opened_path : file_handle->filename; zend_set_compiled_filename(filename TSRMLS_CC); zval *result; DECORS_CALL_PREPROCESS(result, file_handle->handle.stream.mmap.buf, file_handle->handle.stream.mmap.len); zend_set_compiled_filename(prev_filename TSRMLS_CC); // ... } 

In decorators_zend_compile_string, the file name is already known.


Source code modification


Having received everything that is needed for preprocessing, it remains to actually produce it. The task of translating text composed of pieces (tokens) into the final text might not be so simple in C due to the active work with stitching together lines. However, in /PHP/ext/standard/php_smart_str.h there is an implementation of smart strings, which will be very useful to us.
Short
  smart_str str = {0}; smart_str str2 = {0}; smart_str_appendc(&str, '!'); smart_str_appendl(&str, "hello", 5); smart_str_append(&str, &str2); smart_str_append_long(&str, 42); //  .. //    size_t str.len     char* str.c //  : smart_str_free(&str); 

In the loop parsing of lexemes, we glue the resulting string from tokens (zendtext, zendleng), where you need to change / add from yourself. Directly replacement algorithm decorators, IMHO, is not so interesting. From the potentially interesting - check that the T_COMMENT type token is similar to the decorator's description: the regular check '^ # [\ t] * @' (simple cycle, without regexp) is being checked and the address '@' is returned.


Little PHP last


When processing decorators, the source code of the function being decorated changes slightly: the body of the function is wrapped in an anonymous function, which is passed by the parameter to the nearest decorator. Those. for code
 // comment @a(A) @b @c(C, D) /** * yet another comment */ function x(X) { Y } 

As a result of preprocessing, the following code will be obtained:
 // comment /** * yet another comment */ function x(X) { return call_user_func_array(a(b(c(function(X) { Y }, C, D)), A), func_get_args());} 

By A, C, D, X is meant an arbitrary code that is copied as is.
From this the following consequences follow:


Well that's all. If you really read up here, I hope it was interesting.

I will provide and in this article the link to github .

Source: https://habr.com/ru/post/180939/


All Articles