📜 ⬆️ ⬇️

We learn PHP from the inside. Zval

This article is based on the chapter Zvals of the book PHP Internals Book , which I am currently translating into Russian [ 1 ]. The book is primarily aimed at C-programmers who want to write their own extensions for PHP, but I am sure that it will also be useful for PHP developers, since it describes the internal logic of the interpreter. In the article, I left only the basic theory, which should be clear to all developers (not even familiar with PHP or C). For a more complete presentation of the material refer to the book.

Task to attract attention. What will be the result of the following code?
$obj1 = new StdClass(); $obj2 = new StdClass(); $obj1->value = 1; $obj2->value = 1; function f1($o) { $o = 100; } function f2($o) { $o->value = 100; } f1($obj1); f2($obj2); var_dump($obj1); var_dump($obj2); 


Answer
object (stdClass) # 1 (1) {["value"] => int (1)}
object (stdClass) # 2 (1) {["value"] => int (100)}

If you have precisely defined the answer and can explain why it will be like this, then you probably will not learn anything new from this article, otherwise you definitely should read this article to deepen your knowledge.

Basic structure

The basic data structure in PHP is zval (short for “Zend value”). Each zval stores several fields, two of which are the value and the type of this value. This is necessary because PHP is a language with dynamic typing and therefore the type of variables is known only at runtime and not at compile time. In addition, the type of a variable can be changed during the life of a zval, that is, a zval previously stored as an integer can later contain a string.
')
The type of the variable is stored as an integer label (type tag, unsigned char). A label can take one of 8 values, which corresponds to 8 types of data available in PHP. These values ​​should be assigned using constants of the form IS_TYPE . For example, IS_NULL corresponds to the data type null, and IS_STRING to the string.

zvalue_value

The actual value of the variable is stored in the data type union ("union", in the future I will use the terms union or union), which is defined as follows:
 typedef union _zvalue_value { long lval; double dval; struct { char *val; int len; } str; HashTable *ht; zend_object_value obj; } zvalue_value; 

A small explanation for those who are not familiar with the concept of union. Union defines several data members of different types, but at any given time only one value can be used from those defined in the union. For example, if the value.lval data member value.lval been assigned a value, then you can use only value.lval to access the data, access to other data members is invalid and can lead to unpredictable program behavior. The reason for this is that the unions store the data of all their members in the same memory area and interpret the value differently based on the name you are referring to. The size of the memory allocated for the union corresponds to the size of its largest data member.

When working with zval-s, a special tag (type tag) is used, which allows you to determine what type of data is stored in the union at the moment. Before accessing the API, let's look at what data types are supported in PHP and how they are stored.

The simplest data type is IS_NULL : it should not store any value, since it is just null.

For storing numbers, PHP represents 2 types: IS_LONG and IS_DOUBLE , which use the members long lval and double dval respectively. The first is used to store integers, the second - for floating point numbers.

There are a few things you should know about the long data type. First, it is a signed integer, that is, it can contain positive and negative values, but this data type is not suitable for bitwise operations. Secondly, long has different sizes on different platforms: on 32-bit systems it is 32 bits or 4 bytes in size, but on 64-bit systems it can be either 4 or 8 bytes in size. On Unix systems, it is usually 8 bytes in size, while on 64-bit versions of Windows it uses only 4 bytes.

For this reason, you should not rely on a specific value of type long. The minimum and maximum values ​​that can be stored in the long data type are available in the LONG_MIN and LONG_MAX and the size of this type can be determined using the SIZEOF_LONG macro (unlike the sizeof(long) this macro can be used in #if directives).

The double data type is intended for storing floating point numbers and, usually, following the IEEE-754 specification, it is 8 bytes in size. Details of this format will not be discussed here, but you should at least be aware that this type has limited accuracy and often stores not exactly the value you are counting on.

Boolean variables use the IS_BOOL flag and are stored in the long val field as 0 (false) and 1 (true). Since this type uses only 2 values, then, theoretically, it was enough to use a smaller type (for example, zend_bool), but since zvalue_value is a union and under it the memory size corresponding to the largest data member is allocated, the use of a more compact variable for boolean values ​​will not save memory. Therefore, lval is reused in this case.

Strings ( IS_STRING ) are stored in a struct {char *val; int len; } str; struct {char *val; int len; } str; that is, the string is stored as a pointer to the char * string and the integer length of the int string. Strings in PHP must explicitly store their length in order to be able to contain NUL bytes (\ 0) and be binary safe (binary safe). But despite this, the lines used in PHP still end with a null byte (NUL-terminated) to ensure compatibility with library functions that do not accept an argument with a string length, but expect to find a zero byte at the end of the string. Of course, in such cases the strings can no longer be binary safe and will be truncated to the first occurrence of the zero byte. For example, many of the functions associated with the file system and most of the string functions from libc behave in this way.

The string length is measured in bytes (not the number of Unicode characters) and should not include a zero byte, that is, the length of the string foo is 3, despite the fact that 4 bytes are used to store it. If you define the length of a string using sizeof you need to subtract the unit: strlen("foo") == sizeof("foo") - 1 .

It is very important to understand: the length of the string is stored in the int type, and not in a long or some other similar type. This is a historical artifact that limits the length of a string to 2,147,483,647 bytes (2 gigabytes). Larger lines will cause overflow (which will make their length negative).

The remaining three types will be mentioned only superficially.

Arrays use the IS_ARRAY label and are stored in the data member HashTable *ht . How the HashTable data structure works is covered in another article.

The objects ( IS_OBJECT ) use the data member zend_object_value obj , which consists of an “object handle” (an integer ID used to search for real data) and a set of “object handlers” that determine the behavior of the object. The system of classes and objects in PHP will be described in the “Classes and Objects” chapter.

Resources ( IS_RESOURCE ) are similar to objects, since they also store a unique ID used to look up the value. This ID is stored in the long lval member. Resources will be described in the relevant chapter, which has not yet been written.

Let's summarize the intermediate result, below is a table listing all the available type labels and the corresponding value store:
Type tagStorage location
IS_NULLnone
IS_BOOLlong lval
IS_LONGlong lval
IS_DOUBLEdouble dval
IS_STRINGstruct { char *val; int len; } str
IS_ARRAYHashTable *ht
IS_OBJECTzend_object_value obj
IS_RESOURCElong lval

zval

Let's now see what the zval data structure looks like:

 typedef struct _zval_struct { zvalue_value value; zend_uint refcount__gc; zend_uchar type; zend_uchar is_ref__gc; } zval; 

As already mentioned, zval contains terms for hreneniya values ​​and its type. The value is stored in the zvalue_value union, which is described above. The type is stored in the zend_uchar type . In addition, this structure contains 2 additional properties whose names end with __gc , which are used by the garbage collection mechanism. These properties are discussed in more detail in the next section.

Memory management

The zval data structure plays 2 roles. First, as described in the previous section, it stores data and its type. Secondly (this will be discussed in the current section) is used to effectively manage the values ​​in memory.

In this section, we look at the concepts of reference counting and copy-on-write (copy-on-write).

Semantics of knowledge and references

In PHP, all values ​​always have value-semantics semantics only if you have not explicitly requested the use of links. This means that when passing a value to a function, and when performing an assignment operation, you will work with 2 different copies of the value. A couple of examples will help make sure of this:
 <?php $a = 1; $b = $a; $a++; //  $a    1, $b   : var_dump($a, $b); // int(2), int(1) function inc($n) { $n++; } $c = 1; inc($c); //   $c     $n   —    var_dump($c); // int(1) 

The example above is very simple and obvious, but it is important to understand that this is the basic rule that applies everywhere. It also applies to objects:
 <?php $obj = (object) ['value' => 1]; function fnByVal($val) { //     ,     object  integer $val = 100; } function fnByRef(&$ref) { $ref = 100; } // ,    ,   $obj,       — : fnByVal($obj); var_dump($obj); // stdClass(value => 1),  fnByVal     fnByRef($obj); var_dump($obj); // int(100) 

You can often hear that in PHP 5 objects are automatically passed by reference, but the example above shows that this is not the case. The function to which the value is transferred cannot change the value of the passed variable, only the function to which the link is passed can do it.

This is true, although the objects actually behave as if they were passed by reference. You cannot assign a different value to a variable, but you can change the properties of an object. This is possible because the value of the object is the ID that is used to search for the “real data” of the object. Transfer semantics by value will not allow you to change this ID to another one or change the type of a variable, but it will not prevent you from changing the “real data” of the object.

Let's slightly change the example above:

 <?php $obj = (object) ['value' => 1]; function fnByVal($val) { //      ,       $val->value = 100; } var_dump($obj); // stdClass(value => 1) fnByVal($obj); var_dump($obj); // stdClass(value => 100),  fnByVal      

The same can be said about the resources, as they also store only the ID, which can be used to search for data. So again, the semantics of the transfer by value does not allow you to change the ID or the type of zval, but does not prevent you from changing the resource data (for example, shifting the position in the file).

Link counting and copy-on-writing

If you think a little about what was written above, then you will come to the conclusion that PHP must perform a huge number of copy operations. Each time passing a variable to a function, its value must be copied. This may not be a problem for data of type integer or double, but imagine that you are passing an array of ten million values ​​to a function. Copying millions of values ​​each time a function is called is unacceptably slow.

To avoid this, PHP uses the copy-on-write paradigm. Zval can be shared by many variables / functions / etc, but only as long as the data zval is used for reading. As soon as someone wants to change the zval data, it will be copied before the changes are applied.

Since one zval can be used in several places, PHP should be able to determine the moment, the zval code is no longer used by anyone and remove it (free up the memory it occupies). PHP does this by simply counting the links. Note that “link” here is not a link in terms of PHP (the one that is specified with & ), but simply an indicator saying that someone (variable, function, etc.) uses this zval. The number of such links is called refcount and it is stored in the data member refcount__gc zval.

To understand how this works, let's take an example:

 <?php $a = 1; // $a = zval_1(value=1, refcount=1) $b = $a; // $a = $b = zval_1(value=1, refcount=2) $c = $b; // $a = $b = $c = zval_1(value=1, refcount=3) $a++; // $b = $c = zval_1(value=1, refcount=2) // $a = zval_2(value=2, refcount=1) unset($b); // $c = zval_1(value=1, refcount=1) // $a = zval_2(value=2, refcount=1) unset($c); // zval_1 ,   refcount=0 // $a = zval_2(value=2, refcount=1) 

The logic here is simple: when a link is added, the value of refcount incremented by one, when the link is deleted - the refcount decreases. When the value of refcount reaches 0 - zval is deleted.

However, this method will not work in the case of circular references:

 <?php $a = []; // $a = zval_1(value=[], refcount=1) $b = []; // $b = zval_2(value=[], refcount=1) $a[0] = $b; // $a = zval_1(value=[0 => zval_2], refcount=1) // $b = zval_2(value=[], refcount=2) // refcount zval_2    //     zval_1 $b[0] = $a; // $a = zval_1(value=[0 => zval_2], refcount=2) // $b = zval_2(value=[0 => zval_1], refcount=2) // refcount zval_1    //     zval_2 unset($a); // zval_1(value=[0 => zval_2], refcount=1) // $b = zval_2(value=[0 => zval_1], refcount=2) // refcount zval_1 ,  zval //          zval_2 unset($b); // zval_1(value=[0 => zval_2], refcount=1) // zval_2(value=[0 => zval_1], refcount=1) // refcount zval_2 ,  //          zval_1 

After the code above is run, we get a situation in which we will have two zvals that are not accessible through any variable, but still exist in memory, as they refer to each other. This is a classic example of a reference counting problem.

To solve this problem in PHP, another garbage collection mechanism is implemented - the circular garbage collector. We can ignore it now because the circular collector (unlike the reference counting mechanism) is transparent to developers of PHP extensions. If you are interested in this topic, then refer to the PHP documentation, which describes this algorithm.

There is another feature of PHP links (those defined as &$var , and not those that were discussed above) that needs to be considered. To indicate that zval is used as a PHP reference, the is_ref__gc flag is is_ref__gc in the zval structure.

If is_ref=1 this is a signal that zval should not be copied before the modification, instead, the value of zval should be changed:

 <?php $a = 1; // $a = zval_1(value=1, refcount=1, is_ref=0) $b =& $a; // $a = $b = zval_1(value=1, refcount=2, is_ref=1) $b++; // $a = $b = zval_1(value=2, refcount=2, is_ref=1) //   is_ref=1 PHP   zval //       

In the example above, the zval variable $a has a refcount=1 before creating the link. Now consider a similar example with the number of links greater than 1:

 <?php $a = 1; // $a = zval_1(value=1, refcount=1, is_ref=0) $b = $a; // $a = $b = zval_1(value=1, refcount=2, is_ref=0) $c = $b // $a = $b = $c = zval_1(value=1, refcount=3, is_ref=0) $d =& $c; // $a = $b = zval_1(value=1, refcount=2, is_ref=0) // $c = $d = zval_2(value=1, refcount=2, is_ref=1) // $d   $c,  **  $a and $b,  // zval    .     //  zval  is_ref=0   is_ref=1. $d++; // $a = $b = zval_1(value=1, refcount=2, is_ref=0) // $c = $d = zval_2(value=2, refcount=2, is_ref=1) //       2  zvals $d++  //  $a  $b (  ). 

As you can see, when creating a link to zval c is_ref=0 and refcount>1 requires creating a copy. Similarly, when using zval with is_ref=1 and refcount>1 in context with passing by value, a copy operation is required. For this reason, using PHP links usually slows down the code. Almost all functions in PHP use the transfer semantics by value, so they create a copy when they receive a zval with the value is_ref=1 .

Conclusion

In this article, I gave the squeeze of the Zvals chapter of the PHP Internals Book. I tried to keep only the material that would be useful for PHP-developers and cut out a lot of text related to the development of extensions (otherwise the article would have grown 3 times). If you are interested in learning more about the issue of developing extensions for PHP, you can refer to the book or my translation . At the moment, only the head of Zvals is translated, but I continue to work. In the near future, I will take on the most interesting chapters about hash tables and classes.



[ 1 ] The translation of the book is done with the permission of the authors, but it is unofficial. You can read my translation here: romka.gitbooks.io/php-internals-book-ru , help translate here: github.com/romka/phpinternalsbook-ru

Source: https://habr.com/ru/post/226707/


All Articles