📜 ⬆️ ⬇️

Internal representation of values ​​in PHP 7 (part 2)

image
Kore Nordmann

In the first part, we looked at the high-level differences in the internal representation of values ​​between PHP 5 and PHP 7. As you remember, the main difference is that zval no longer allocated separately and does not store refcount in itself. Simple values, such as integer or floating point, can be stored directly in zval , while complex values ​​are represented using a pointer to a separate structure.

All of these additional structures use a standard header defined using zend_refcounted :
 struct _zend_refcounted { uint32_t refcount; union { struct { ZEND_ENDIAN_LOHI_3( zend_uchar type, zend_uchar flags, uint16_t gc_info) } v; uint32_t type_info; } u; }; 

This header now contains the refcount , data type, information for the gc_info garbage gc_info , and also a cell for the type- flags . Next, we look at the individual complex types and compare them with the implementation in PHP 5. In particular, we will focus on the links that have already been discussed in the first part of the article. We will not touch the resources, since I do not find them interesting enough to be considered here.

Strings


In PHP 7, strings are represented using the type zend_string :
 struct _zend_string { zend_refcounted gc; zend_ulong h; /* hash value */ size_t len; char val[1]; }; 

In addition to the refcounted header, the hash cache h, length len and val are also used here. The hash cache is used to not recalculate the hash of the string each time the HashTable . When first used, it is initialized as a non-zero hash.
')
If you are not too familiar with the various hacks in C, then the definition of val may seem strange: it is declared as an array of characters with a single element. But we certainly want to store strings longer than one character. Here we use a method called “struct hack”: although the array is declared with one element, but when creating a zend_string we define the possibility of storing a longer string. In addition, it will be possible to access longer lines with val .

Technically, this is an implicit feature, because we read and write a single-character array. However, C compilers understand what's what, and successfully process the code. C99 supports this feature as “members of a dynamic array”, however, thanks to our friends from Microsoft, C99 cannot be used by developers who need cross-platform compatibility.

The new implementation of a string variable has a number of advantages over ordinary strings in the C language. First, the length is now integrated into it, which no longer “dangles” somewhere nearby. Secondly, reference counting is used in the header, so it became possible to use strings in different places without using zval . This is especially important for sharing hash table keys.

But there is a big spoon of tar. It is zend_string string of C language from zend_string (using str-> val), but you can't directly get a zend_string from the C-string. To do this, you have to copy the value of the string into the newly created zend_string. Especially annoying when it comes to working with text strings (literal string), that is, constant strings (constant string), found in the original C-code.

A string can have various flags stored in the corresponding GC field:
 #define IS_STR_PERSISTENT (1<<0) /* allocated using malloc */ #define IS_STR_INTERNED (1<<1) /* interned string */ #define IS_STR_PERMANENT (1<<2) /* interned string surviving request boundary */ 

Persistent strings use the regular system allocator instead of the Zend memory manager (ZMM), and therefore there may be more than one request. If you specify the used allocator as a flag, you can transparently use persistent strings in zval . In PHP 5, this required prior copy to ZMM.

Isolated (interned) lines are those lines that are not destroyed until the request is completed and therefore do not need to use a reference counter. They are deduplicated, so when creating a new isolated line, the engine first checks to see if there is another one with the same value. In general, all the lines in PHP code (including variables, function names, etc.) are usually isolated. Immutable strings are isolated strings created prior to the start of a query. They are not destroyed at the end of the request, as opposed to isolated.

If OPCache is used, isolated lines will be stored in shared memory (SHM) and used by all PHP processes. In this case, immutable lines become useless, since the isolated and so will not be destroyed.

Arrays


I will not go into details regarding the new implementation of arrays. I will mention only about immutable arrays. This is a kind of analogue of isolated lines. They also do not use the reference counter and are not destroyed until the end of the request. Due to some memory management features, immutable arrays are used only when OPCache is running. What this gives can be seen from the example:
 for ($i = 0; $i < 1000000; ++$i) { $array[] = ['foo']; } var_dump(memory_get_usage()); 

With OPCache enabled, 32 MB of memory is used, and without it, as much as 390, since in this case each element of $array receives a new copy of ['foo'] . Why make a copy instead of increasing the reference count? The fact is that the VM string operands do not use a reference counter, so as not to break the SHM. I hope that in the future this catastrophic situation will be corrected and it will be possible to abandon OPCache.

Objects in PHP 5


Before we talk about the implementation of objects in PHP 7, let's remember how it was arranged in PHP 5 and what were the drawbacks. zval used to store zend_object_value , defined as follows:
 typedef struct _zend_object_value { zend_object_handle handle; const zend_object_handlers *handlers; } zend_object_value; 

handle is a unique object ID used to search its data. handlers are VTable function pointers that implement different object behavior. For "normal" objects, this handler table will be the same. But objects created by PHP extensions can use custom handler sets that change the behavior of objects (for example, overriding operators).

The object identifier is used as an index in the “object repository”. It is an array:
 typedef struct _zend_object_store_bucket { zend_bool destructor_called; zend_bool valid; zend_uchar apply_count; union _store_bucket { struct _store_object { void *object; zend_objects_store_dtor_t dtor; zend_objects_free_object_storage_t free_storage; zend_objects_store_clone_t clone; const zend_object_handlers *handlers; zend_uint refcount; gc_root_buffer *buffered; } obj; struct { int next; } free_list; } bucket; } zend_object_store_bucket; 

There are a lot of interesting things. The first three elements are some kind of metadata (whether the object's destructor was called, whether this bucket was used at all, how many times the recursive algorithm addressed this object). The union construction depends on whether the storage is currently being used or is on the free list. The case when struct_store_object used is important to us.

object is a pointer to a specific object. It is not integrated into the object storage, since the objects do not have a fixed size. The pointer is followed by three handlers responsible for the destruction, release and cloning. Please note that in PHP, the operations of destroying and releasing objects are explicit procedures, although the first one may be skipped in some cases (unclean shutdown). The cloning handler is virtually not used at all. Since these store handlers do not belong to regular object handlers, instead of sharing, they are duplicated for each object.

These storage handlers follow the pointer to normal handlers . Those are saved if the object was destroyed without notifying the zval (in which handlers are usually stored).

The repository also contains refcount , which gives certain advantages in view of the fact that in PHP 5 the reference counter is already stored in zval . Why do we need two counters? Usually zval “copied” by simply increasing the counter. But it happens that full-fledged copies appear, that is, a completely new zval is created for the same zend_object_value . As a result, two different zval use the same object storage, which requires reference counting. This “double counting” is a characteristic feature of the zval implementation in PHP 5. For the same reasons, the buffered pointer in the GC root buffer is duplicated.

Consider the object referenced by the object repository. Common objects in user space are defined as follows:
 typedef struct _zend_object { zend_class_entry *ce; HashTable *properties; zval **properties_table; HashTable *guards; } zend_object; 

zend_class_entry is a pointer to a class whose essence is an object. The following two elements are used to provide the storage of object properties in two different ways. For dynamic properties (that is, those that are added at run time and are not declared in the class), the properties hash table is used, which connects the property names and their values.

For the declared properties, optimization is used. During compilation, each such property is written into an index, and its value is stored in the index in properties_table . Relationships between names and an index are stored in a hash table in a class entry. This prevents individual objects from overruning the hash table. Moreover, the property index is polymorphically cached during execution.

The guards hash table is used to implement the recursive behavior of "magic" methods like _get , but here I will not consider it.

In addition to the aforementioned double reference counting, the representation of the object also requires a large amount of memory. The minimum object with one property is 136 bytes (not counting zval). Moreover, a lot of indirect addressing is used. For example, to call a property from a zval object, you have to first call the object storage, then the Zend object, then the property table, and finally the property referenced by zval . At least four levels of indirect addressing, and in real projects there will be at least seven.

Objects in PHP 7


All the above shortcomings tried to fix in the seventh version. In particular, they abandoned double counting of links, reduced memory consumption and the amount of indirect addressing. This is the new zend_object structure:
 struct _zend_object { zend_refcounted gc; uint32_t handle; zend_class_entry *ce; const zend_object_handlers *handlers; HashTable *properties; zval properties_table[1]; }; 

This structure is almost all that remains of the object. zend_object_value , replaced by a direct pointer to the object and storage of objects, although not completely ruled out, but less often.

In addition to the traditional header zend_refcounted , inside the zend_object handle and handlers “moved”. properties_table now also uses a structured hack, so zend_object and the property table are placed in one block. And of course, zval itself is now directly included in the property table, not pointers to them.

The guards table is now removed from the object structure and is stored in the first properties_table cell, if the object uses __get , etc. If these “magic” methods are not used, then the guards table is not involved.

The dtor , free_storage and clone handlers that were previously stored in the object storage moved to the handlers table:
 struct _zend_object_handlers { /* offset of real object header (usually zero) */ int offset; /* general object functions */ zend_object_free_obj_t free_obj; zend_object_dtor_obj_t dtor_obj; zend_object_clone_obj_t clone_obj; /* individual object functions */ // ... rest is about the same in PHP 5 }; 

The offset element is not exactly a handler. It is related to the way objects are represented: an internal object always embeds a standard zend_object , but at the same time it usually adds a certain number of elements “from above”. In PHP 5, they were added after the standard object:
 struct custom_object { zend_object std; uint32_t something; // ... }; 

That is, you can simply send zend_object* to your custom struct custom_object* . This suggests the introduction of structure inheritance in the C language. However, the approach in PHP 7 has its own peculiarities: since zend_object uses a structured hack to store the property table, PHP stores properties in the zend_object itself, overwriting additional internal elements. Therefore, in the seventh version, additional methods are stored in front of the standard object:
 struct custom_object { uint32_t something; // ... zend_object std; }; 

This leads to the fact that it is no longer possible to directly convert between zend_object* and struct custom_object* because of the offset between them using a simple conversion. It is stored in the first item in the object handler table. At compile time, offset can be defined using the macro offsetof() .

You probably wonder why PHP 7 still has a handle . Because now a direct pointer to zend_object , so there is no longer any need to use a handle to search for an object in the storage. However, the handle is still needed, because there is still a repository of objects, albeit in a substantially truncated form. Now it is a simple array of pointers to objects. When an object is created, the pointer is placed in the repository in the index handle , and is removed from there when the object is released.

What else do you need a storage facility for? During the completion of a request, there comes a time when the execution of a custom code may be unsafe, because the worker has partially stopped working. To avoid this situation, PHP runs all object destructors at an early stage of completion. For this, you need a list of all active objects.

Also, handle is useful for debugging because it gives each object a unique ID. This allows you to immediately understand whether the two objects are the same. The object handler is still stored in HHVM, although it is not a repository of objects.

Unlike PHP 5, now only one reference counter is used (it is no longer in zval ). The memory consumption has significantly decreased, now 40 bytes are enough for the base object, and 16 bytes for each declared property, including zval . It has become much less indirect addressing, since many intermediate structures have been excluded or merged with other structures. Therefore, when reading a property, now only one level of indirect addressing is used instead of four.

Indirect zval


Let's now consider the special types of zval used in special cases. One of them is IS_INDIRECT . The value of indirect zval is stored elsewhere. Note that this type of zval differs from the IS_REFERENCE in that it directly points to another zval , unlike the zend_reference structure in which zval embedded.

When can this zval type be useful? Let's first consider the implementation of variables in PHP. All variables that are known at the compilation stage are entered into the index, and their values ​​are written into the table of compiled variables (CV) in this index. But PHP also allows us to dynamically reference variables using variable variables or, if you are in the global scope, using $GLOBALS . With this access, PHP creates a symbol table for the function / script containing a map of the names of the variables and their values.

The question arises: how can you simultaneously support two different types of access? To call normal variables, we need access using the CV table, and for variable variables, using the symbol table. In PHP 5, the CV table used zval** twice indirect pointers. In a normal situation, these pointers lead to the second pointer table, zval* , and it, in turn, refers to the zval :
 +------ CV_ptr_ptr[0] | +---- CV_ptr_ptr[1] | | +-- CV_ptr_ptr[2] | | | | | +-> CV_ptr[0] --> some zval | +---> CV_ptr[1] --> some zval +-----> CV_ptr[2] --> some zval 

Now, since we are using a symbol table, the second table with single zval* pointers is no longer used, and zval** pointers refer to hash table storages. A small illustration with three variables $ a, $ b and $ c:
 CV_ptr_ptr[0] --> SymbolTable["a"].pDataPtr --> some zval CV_ptr_ptr[1] --> SymbolTable["b"].pDataPtr --> some zval CV_ptr_ptr[2] --> SymbolTable["c"].pDataPtr --> some zval 

In PHP 7, this approach is no longer possible, because the pointer to the repository will be invalidated when the size of the hash table is changed. Now this approach is used: for the variables stored in the CV table, the hash table of characters contains an INDIRECT entry pointing to the CV entry. The CV table is not redistributed as long as the symbol table exists. Therefore, there is no longer a problem with invalid pointers.

If you take a function with CV $ a, $ b and $ c, as well as a dynamically created variable $ d, then the symbol table might look like this:
 SymbolTable["a"].value = INDIRECT --> CV[0] = LONG 42 SymbolTable["b"].value = INDIRECT --> CV[1] = DOUBLE 42.0 SymbolTable["c"].value = INDIRECT --> CV[2] = STRING --> zend_string("42") SymbolTable["d"].value = ARRAY --> zend_array([4, 2]) 

Indirect zval can also point to zval IS_UNDEF . In this case, it is processed as if the hash table does not contain associated keys. And if unset($a) writes the UNDEF type to CV[0] , it will be processed as if the character table does not have the key “a”.

Constants and AST


Finally, IS_CONSTANT IS_CONSTANT_AST two special types of zval , available in PHP 5 and 7 - IS_CONSTANT and IS_CONSTANT_AST . To understand their purpose, consider an example:
 function test($a = ANSWER, $b = ANSWER * ANSWER) { return $a + $b; } define('ANSWER', 42); var_dump(test()); // int(42 + 42 * 42) 

By default, the parameter ANSWER is used for the test() function parameter values. But it is not yet defined at the time of the function declaration. The value of the constant will be known only after calling define() . Therefore, the default values ​​of parameters and properties, as well as constants and all elements capable of accepting a “static expression”, can postpone the evaluation of the expression until the first use.

If the value is a constant (or a class constant), then an zval type IS_CONSTANT with the name of a constant is used. If the value is an expression, then zval type IS_CONSTANT_AST , referring to an abstract syntax tree (AST).

* * *

At this point, let me complete such a comprehensive overview of the representation of values ​​in PHP 7.

Source: https://habr.com/ru/post/261131/


All Articles