In December 2015 , PHP 7.0 was released. Companies that have switched to the "seven" have noted that productivity has increased, and the load on the servers has decreased. The first went to the seven Vebia and Etsy, and we have Badoo, Avito and OLX. For Badoo, the transition to the seven cost $ 1 million in savings on servers. Thanks to PHP 7 in OLX, the average server load has decreased by 3 times, the efficiency and resource savings have increased.
Dmitry Stogov from Zend Technologies at HighLoad ++ explained how his performance improved. In the decoding: about the internal structure of PHP, about the ideas in the basis of version 7.0, about changes in the basic data structures and algorithms that determined success.
Disclaimer: As of March 2019,80% of the siteswork on PHP, and70% of them workon PHP 5, althoughthis version is not supportedsince January 1, 2019.Dmitry's report from 2016 on the principles that have led to a two-fold performance jump between PHP 5 and 7 is also relevant in March 2019. For half of the sites, for sure. About the speaker: Dmitry Stogov began programming in the 80s: “Electronics B3-34”, Basic, assembler. In 2002, Dmitry met PHP and soon began working on improving it: he developed Turck MMCache for PHP, managed the PHPNG project and played an important role in working on JIT for PHP. The last 14 years of Principal Engineer at Zend Technologies. ')
Zend Technologies develops PHP and commercial solutions based on it. In 1999, it was founded by Israeli programmers Andy Gutmans and Zeev Sourasky, who two years ago created PHP 3. These people were at the very beginning of PHP development and in many respects determined the current look of the language and the success of the technology.
Zend Technologies develops the PHP core and applications for it, and during the work I had to write extensions, get into all subsystems, and even engage in commercial projects, sometimes not related to PHP at all. But the most interesting topic for me has always been performance .
I started looking for ways to speed up PHP even before joining Zend, working on my own project that competed with the company. During the work on the project, I thoroughly understood the language and realized that working not with the mainstream project can only affect certain aspects of the script execution, and all the most interesting and effective can be created only in the core . This understanding and coincidence led me to Zend.
A little excursion into the history of PHP
PHP is not really and not just a programming language . PHP is translated as Personal Home Page - a tool for creating personal web pages and dynamic websites. Language is only one of its main parts. PHP is a huge library of functions, many extensions for working with other third-party libraries, for example, for accessing databases or XML parsers, as well as a set of modules for communicating with various web servers.
Danish programmer Rasmus Lerdorf introduced PHP in June 1995 . At that time, it was just a collection of CGI scripts written in Perl . In April 96, Rasmus introduced PHP / FI, and in June the version of PHP / FI 2.0 was released. Subsequently, this version was significantly reworked by Andy Gutmans and Zeev Suraski, and in 1998 they released PHP 3.0. By 2000, the language came to the form that we are used to seeing today, both in terms of language and internal architecture - PHP 4, based on the Zend Engine.
Since the 4th version of PHP is evolving evolutionary. The turning point was the release of PHP 5 in 2004, when the object model was completely updated . It was she who opened the era of PHP frameworks and raised the issue of performance to a new level. Anticipating this, immediately after the release of 5.0, we at Zend thought about accelerating PHP and began to work on improving performance.
Version 7.1, which was released in November 2016 on synthetic tests is 25 times faster than the 2002 version . According to the schedule of performance changes in different branches, the main breakthroughs are visible in 5.1 and 7.0.
In version 5.1, we just started working on performance, and everything we took for was good, but after 5.3 we ran into the wall, all attempts to improve the interpreter did not lead to anything.
Nevertheless, we found where to dig, and got even more than expected - 2.5-fold acceleration compared to the previous version 5.6 on the tests. But the most interesting thing is that we got the same 2.5-fold acceleration on unchanged real-world applications. This is a phenomenon, because we have accumulated the previous factor 2 throughout the life of the five for 10 years.
A huge jump in 5.1 on synthetic tests, on real applications is not noticeable. The reason is that with different uses, PHP's performance rests on the brakes associated with different subsystems.
The history of PHP 7 begins with a three-year stagnation , which began in 2012, and ended in 2015 with the release of the seventh version. Then we realized that we could no longer increase the performance with minor improvements to our interpreter and turned in the direction of JIT.
Wandering around jit
For almost two years, we spent on a JIT prototype for PHP-5.5. First, we generated a very simple code - a sequence of calls for standard handlers, something like a stitched Fort code. Then they wrote their own Runtime Assembler , inline a separate code for detours, but realized that such low-level optimizations did not give a practical effect even on tests.
Then we thought about the derivation of types of variables using static analysis methods. Implementing the conclusion, immediately received a 2-fold acceleration on the tests. Encouraged, they tried to write a global register alocator, but failed. We used a fairly high-level view, and for register allocation it was almost impossible to use it.
To avoid problems with a low level, we decided to try LLVM, and a year later we had a 10-fold acceleration for bench.php , but nothing on real applications. In addition, the compilation of real applications now took minutes, for example, the first requester to Wordpress took 2 minutes and did not give acceleration. Of course, it was completely unsuitable for real practice.
Good code is possible with proper type prediction, which in real applications does not work well, and the use of PHP data structures makes the generated code ineffective.
What inhibits?
We rethought the reasons for the failures and decided once again to see why PHP slows down. In the picture, the result of profiling multiple requests to the Wordpress home page.
Less than 30% is spent on the interpretation of the byte-code, 20% is the overhead of the memory manager, 13% is working with hash tables, and 5% is working with regular expressions.
Working at JIT, we got rid of only the first 30%, and everything else was a dead weight. Almost everywhere, we were forced to use standard PHP data structures that entailed overhead: memory allocation, reference counting, and so on. This understanding has led to the conclusion that it is necessary to replace key data structures in PHP. With this substitution of the foundation , the PHPNG project began.
PHPNG. New generation
The project was developed after unsuccessful attempts to create a JIT for PHP. The main goal is to achieve a new level of performance and lay the foundation for future improvements .
We promised ourselves some time not to use synthetic tests to measure performance - these are usually small computational programs that use a limited amount of data that fits completely into the processor’s cache. Real applications, on the contrary, are subject to the brakes associated with the subsystem memory, and a single reading from memory can cost 100 computational instructions. The PHPNG project is a refactoring of key PHP data structures to optimize memory access . No innovations, 100% compatibility with PHP 5.
How to change these structures was clear. But the volume of dependent changes was huge because the core of PHP itself is 150,000 lines , and almost every third needed to be changed. Add another hundred extensions that are included in the base distribution, a dozen modules for different web servers, and you will understand the grandeur of the project.
We were not even sure that we would complete the project. Therefore, they launched the project in secret and opened it only when the first optimistic results appeared. It took two weeks to simply compile the kernel . Two weeks later, earned bench.php. Spent a month and a half to make Wordpress work. A month later, we opened the project - it was May 2014. At that time, we had a 30% acceleration on Wordpress . This already seemed like a big deal.
PHPNG immediately caused a wave of interest, and in August 2014 was adopted as the basis for the future of PHP 7 . It was a different project, with a different set of goals, where productivity was only one of them.
PHP 7.0
Version 7 itself was questionable. The previous version was the fifth. And the sixth was developed several years ago and was completely devoted to native support for Unicode , but the unsuccessful decisions made in the early stages of development led to an excessive complication of the kernel code and each extension. In the end, it was decided to freeze the project.
By this time, a lot of material on PHP 6 had already been accumulated: presentations at conferences, published books. To not confuse anyone, we called the project PHP 7, skipping PHP 6. This version was lucky much more - PHP 7 was released in December 2015, almost according to plan.
In addition to performance, in PHP 7, some long-awaited innovations have appeared:
Ability to define scalar types of parameters and return values.
Exceptions instead of errors - now we can catch and handle them.
Zero-cost assert() , anonymous classes, cleaning inconsistencies, new operators and functions (<=>, ??) have appeared.
Innovation is good, but back to the internal changes. Let's talk about the path that PHP 7 went through, and where this path might lead us.
zval
This is the main data structure of PHP. It is used to represent any value in PHP . Since the language we have is dynamically typed and the type of variables can change in the execution of the program, we need to keep the field type (zend_uchar type), which can be IS_NULL, IS_BOOL, IS_LONG, IS_DOUBLE, IS_ARRAY, IS_OBJECT, etc., and the actual value , represented by a union (value), where an integer, a real number, a string, an array, or an object can be stored.
zval in php 5
Memory for each such structure was allocated separately in Heap. In addition to the type and value, it also stored a reference count to the structure. So the structure occupied 24 bytes, not counting the overhead of the memory-manager and a pointer to it.
The picture on the right above shows the data structures that were created in PHP 5 memory for a simple script.
The stack has allocated memory for 4 variables, represented by pointers. The values ​​themselves (zval) are in the heap. In our case, these are just two zval, each of which is referred to by two variables, and, accordingly, their reference counters are set to 2.
To access a type or scalar value, you need at least two readings: first read the value of the pointer, and then the value of the structure. If you need to read not a scalar value, but for example, a part of a string or array, then you will need at least one more reading more.
zval in php 7
Where we used pointers before, in the seven we began to embed zval. We have gone from reference counting for scalar types. The type and value fields remained without significant changes, but some more flags and a reserved place were added, which I will discuss later.
On the left, what it looked like in PHP 5, and on the right in PHP 7.
Now on the stack themselves are zval. To read types and scalar values, all you need is a single machine instruction. All values ​​are grouped in one memory area, which means that when working with local variables, we will have practically no losses due to processor cache misses. But the real power of the new view is included when copying is necessary.
Copy Record
In the top line of the script one more assignment was added.
In PHP5, we allocated from memory a new zval from the heap, initialized its int (2), changed the value of the pointer to the variable b, and decreased the reference counter of the value to which b referred earlier.
In PHP 7, we simply initialized the variable b directly in place with a few instructions , while in PHP 5 it required hundreds of instructions. So zval looks now in memory.
These are two 64-bit words. The first word is a value: integer, real or pointer. In the second word, the type (he tells how to interpret the meaning), the flags, and the reserved space that would still be added during alignment. But it does not disappear, but is used by different subsystems for storing indirectly related values.
Flags are a set of bits , where each bit tells whether zval supports a protocol. For example, if IS_TYPE_REFCOUNTED , then when working with this zval, the engine should take care of the value of the reference count. On assignment — increment; when going out of scope — decreasing; if the reference counter reaches zero — destroy the dependent structure.
Of the types, compared with PHP 5, there are several new ones.
IS_UNDEF is a marker for an uninitialized variable.
The single IS_BOOL replaced by separate IS_FALSE and IS_TRUE .
Added a separate type for links and a few more magical types.
Types from IS_UNDEF to IS_DOUBLE are scalar and do not require additional memory. To copy them, just copy the first machine 64-bit word with the value and half the second with the type and flags.
Refcounted
With other types harder. All of them are represented by a subordinate structure, and zval simply stores a reference to this structure. For each type, this structure has its own, but in OOP terms, they all have a common abstract ancestor or zend_refcounted structure. It defines the format of the first 64-bit word where the reference counter and other information for the garbage collector is stored.
This word can be viewed simply as information for the garbage collector, and structures for specific types add their fields after this first word.
Strings
In the seven for the string, we store the computed value of the hash function, its length, and the characters themselves. The size of this structure is variable and depends on the length of the string. The hash function is calculated for the string once, when necessary. In PHP 5, it was re-computed at every need.
Now the lines have become reference countable, and if in PHP 5 we copied the characters themselves, now it is enough to increase the reference count to this structure.
As in PHP 5, we are left with the notion of immutable or interned strings . They usually exist in one instance, live to the end of the query, and can behave like scalar values. We have no need to take care of the counter of links to them, and to copy it is enough to copy only the zval itself using four machine instructions.
Arrays
Arrays are represented by a built-in hash table and are not much different from PHP 5. The hash table itself has changed, but about this separately.
Arrays are now an adaptive structure that slightly changes its internal structure and behavior depending on the stored data. If we store only elements with close numeric keys, we get access to the elements directly by index with a speed comparable to the speed of arrays in C. But it is necessary to add an element with a string key to the same array - it turns into a real hash with collision resolution.
So the hash table looks like in PHP 5.
This is a classic implementation of a hash table with collision resolution using linear lists (shown in the upper right corner). Each item is represented by a bucket. All Buckets are linked by doubly linked lists for resolving collisions, and are also linked by another doubly linked list for iteration in order. Values ​​for each zval are allocated separately - in the Bucket we store only a link to it. Also, string keys can be allocated separately.
Thus, for each hash table you need to allocate a lot of small blocks of memory, and then to find something, you have to run along the pointers. Each such transition can cause cahce miss and a delay of ~ 10-100 processor cycles.
This is what happened in PHP 7.
The logical structure has remained unchanged, only the physical has changed. Now for a hash table, memory is allocated using a single operation.
In the picture, at the bottom of the base index, there are elements, and at the top - a hash array, which is addressed by a hash function. For flat or packaged arrays, when we store only elements with numeric indexes, the upper part is not allocated at all, and we refer to buckets directly by number.
To bypass the elements, we sequentially go through them from top to bottom or from bottom to top, which modern processors do flawlessly. The values ​​are built into the buckets, but the reserved space in them is just used to resolve collisions. There is stored the index of another Bucket with the same hash function or a marker at the end of the list.
Memory for string values ​​of keys is allocated separately, but these are all the same zend_string. When inserting into an array, it is enough to increase the reference counter of the line, although earlier we had to copy the characters directly, and when searching we can now compare not the characters, but the pointers to the lines themselves.
Immutable arrays
Previously, we had immutable strings, and now there are also immutable arrays. Like strings, they do not use a reference count and are not destroyed until the end of the request. This is a simple script that creates an array of one million elements, and each element is one and the same array with a single “hello” element.
In PHP 5, at each iteration of the loop, a new empty array was created, “hello” was written to it, and all this was added to the resulting array. In PHP 7, at compile time, we create only one immutable array that behaves like a scalar, and add it to the resulting one. In the example presented, this makes it possible to achieve more than 10-fold reduction in memory consumption and almost 10-fold acceleration.
Constant arrays of millions of elements in real applications, of course, are not often, but small - quite often. On each of them you get a small, but a gain.
Objects
Links to all objects in PHP 5 lay in a separate repository, and in zval there was only a handle - a unique object ID.
To get to the object, we made at least 3 readings. In addition, the memory for the value of each property of the object is distributed separately, and we needed at least 2 more readings to read it.
In PHP 7, we were able to go to direct addressing.
Now the zend_object address is accessible with one machine instruction. And Property are built in and for their reading only one additional reading is needed. They are also grouped together, which improves the data locality and helps modern processors not to stumble.
In addition to the predefined properties, a reference to the class of this object is also stored here, some handlers are analogous to tables of virtual methods, and a hash table for property that has not been defined. In PHP, you can add properties to any object that were not originally defined, and if several machine instructions are enough to access predefined properties, then for non-predefined properties you will have to refer to a hash table, which will require dozens of machine instructions. Of course, it is much more expensive.
Reference
Finally, we had to introduce a separate type to represent PHP links.
This is an absolutely transparent type. It is not visible to PHP scripts. Scripts see another zval, which is built into the structure of zend_reference. It is implied that we refer to one such structure from at least two places, and the reference counter of this structure is always greater than 1. As soon as the counter drops to 1, the link becomes a normal scalar value. The zval embedded in the link is copied to the last zval that references it, and the structure itself is deleted.
It seems that now working with reference is much more difficult than with other types (and this is true), but in fact in PHP 5 we had to perform work of comparable complexity when referring to any value (even a simple integer). Now we apply more complex protocols to only one type and thereby accelerated work with all others, especially with scalar values.
IS_FALSE and IS_TRUE
I already said that the single type IS_BOOL was split into separate IS_FALSE and IS_TRUE. This idea was spied in the implementation of LuaJIT, and was made to speed up one of the most frequent operations - the conditional transition.
If PHP 5 was required to read a type, check for a boolean, read a value, find out if it is true or false and make a transition based on this, then now it is enough just to check the type and compare it with true:
if it is true, then we go on one branch;
if it is less than true, go to another branch;
if it is more true, go to the so-called slow path (slow path) and there we check what type it came and what to do with it: if it is an integer, then we should compare its value with 0, if float - again with 0 ( but real), etc.
Calling convention
A change in the Calling Convention or function call convention is an important optimization that affects not only data structures, but also basic algorithms. The picture on the left shows a small script consisting of the foo () function and its call. Below is the bytecode in which this script was compiled by PHP 5.
First I’ll tell you how this worked in PHP 5.
Calling Convention in PHP 5
The first instruction SEND_VAL should have sent the value “3” to the function foo. To do this, she was forced to allocate a new zval on the heap, copy the value (3) there and write the value of the pointer to this structure on the stack.
Similarly with the second instruction. Further DO_FCALL initialized CALL FRAME , reserved space for local and temporary variables, and transferred control to the called function.
The first RECV checked the first argument and initialized the slot of the corresponding local variable ($ a) on the stack. Here we did without copying and simply increased the reference count of the corresponding parameter (zval with a value of 3). Similarly, the second RECV established a connection between the $ b variable and parameter 5.
Next is the function body. There was an addition of 3 + 5 - it turned out 8. This is a temporary variable and its value was stored directly on the stack.
RETURN and we return from the function.
On return, we release all variables and arguments that are out of scope. To do this, we go through all the zval that slots from the freed frame refer to, and for each we reduce the reference count. If it reaches 0, then destroy the corresponding structure.
As you can see, even such a simple operation as sending a constant to a function requires allocating a new memory, copying and increasing the reference count, and then double decrement and delete.
Calling Convention in PHP 7
In PHP 7, these problems have been fixed - now we are not storing pointers to zvals on the stack, but zvals themselves.
We also introduced the new INIT_FCALL instruction, which is now responsible for initializing and allocating memory for CALL FRAME , and reserving space for arguments and temporary variables.
SEND_VAL 3 now simply copies the argument to the first slot in the CALL FRAME . Next SEND_VAL 5 in the second slot.
Next is the most interesting. It would seem that DO_FCALL should transfer control to the first instruction of the function being called. But the arguments are already in the slots that are reserved for the variable parameters $ a and $ b, and the RECV instructions just do nothing. Therefore, you can simply skip them. We sent two parameters, so we skip two instructions. If they sent three, they would have missed three.
So we go directly to the function body, perform addition and return.
When returning, we clear all local variables, but now only for two slots, and since we have scalars there, we do not need to do anything again.
My story is slightly simplified; it does not take into account functions with a variable number of arguments and the need for type checking and some other points.
The new Calling Convention broke compatibility a bit . PHP has functions such as func_get_arg and func_get_args . If earlier they returned the original value of the sent parameter, they now return the current value of the corresponding local variable, because we simply do not store the original values. Just like debuggers from C.
In addition, the function can no longer have several parameters with the same name. There was no sense in it before, but I met such PHP code foo($_, $_) . What does it look like? (I learned Prolog)
New memory manager
Having finished with the optimization of data structures and basic algorithms, we once again paid attention to all the braking subsystems. The PHP 5 memory manager took up almost 20% of CPU time on Wordpress.
After we got rid of a lot of allocations, his overhead costs became less, but still significant - not because he did some substantial work, but because he stumbled on the cache. This was due to the fact that we used the classical algorithm of Doug Lea's malloc, which meant finding suitable free memory areas with the help of traveling through links and trees, and all these journeys inevitably caused cache misses.
Today, there are new memory management algorithms that take into account the features of modern processors. For example: jemalloc and ptmalloc from Google . At first, we tried to use them in an unchanged form, but did not win, because the lack of PHP-specific functionality made it more expensive to completely release memory at the end of the request. As a result, we abandoned dlmalloc and wrote something of our own, combining ideas from the old memory manager and jemalloc.
We reduced the Memory Manager overhead by up to 5% , reduced memory overhead for service information, and improved CPU cache utilization. Suitable memory blocks are now searched by bitmaps, memory for small blocks is allocated from individual pages and cached when released, and specialized functions are added for frequently used block sizes.
Many minor improvements
I told only about the most important improvements, but there were more small ones. I can mention some of them.
A quick API for parsing the parameters of internal functions and a new API for iterating over the HashTable.
New VM instructions: string concatenation, specialization, super instructions.
Some internal functions have been turned into VM instructions: strlen, is_int.
Use of CPU registers for VM: IP and FP registers.
Optimization of duplicate and delete arrays.
Use reference counters instead of copying wherever possible.
PCRE JIT.
Optimization of internal functions and serialize ().
Reducing the size of the code and the data being processed.
Some were very simple, for example, it took only three lines of code to enable JIT in regular Perlovsky expressions, and this immediately brought visible (2-3%) acceleration to almost all applications. Other optimizations have touched upon some narrow aspects of certain PHP functions, and are not particularly interesting, although the total contribution of all these minor improvements is quite significant.
What came to
This is a contribution of various subsystems on WordPress / PHP 7.0.
The contribution of the virtual machine increased to 50%.Memory Manager already consumes less than 5% - and mostly not by optimizing the Memory Manager itself, but by reducing the number of calls to it. If earlier memory was allocated 130 million times on the same test, now it is only 10 million. It may seem that all the main acceleration is achieved by reducing the memory manager overhead and reducing the number of calls to it by improving data structures, but in fact all subsystems have been significantly improved.
Main sources of acceleration:
The interpreter began to work better 2 times.
MM overhead decreased 17 times.
Hash tables began to work 4 times faster.
The overall performance of WordPress increased by 3.5 times.
At the beginning of the article we talked about 2.5-fold real acceleration, and now the numbers are different. Why is that?The fact is that we measured the real speed in requests per second, and here the speed is measured by the profiler in terms of CPU time, in fact, by the processor clock, when it is not idle. When PHP waits for a response from the database, the processor is worth and this time is not counted here.
PHP 7 performance
WordPress 3.6 was our main benchmark - we monitored performance on it from the first days of work. At some point, when the mysql extension was dropped from PHP 7, we had to specifically support it, just to continue this schedule.
The graph shows that the main breakthroughs occurred in the first months of work on PHPNG. By August, there were 2/3 improvements. Then we moved in small steps, and scored the remaining third.
Of course, we measured the performance not only on WordPress, but also on other popular applications, and almost everywhere we see - from 1.5 to 2-fold acceleration.
PHP 7 and HHVM
According to our version, we almost everywhere overtook even the current versions of HHVM.
But comparison with a third-party product is a thankless task. Always win in favor of measuring. The version of the Facebook team shows other results. On the graph, HHVM is proportionally faster everywhere. Perhaps this is due to different measurement procedures, testing on different hardware platforms, the difference in fine-tuning, and maybe subjective factors have also influenced.
Apotheosis of PHP 7 - the beginning of the use of large sites. The pioneers were Chinese Vebia, American Etsy and Badoo. The highload check revealed several significant problems, but they were quickly localized and fixed.
The transition to PHP 7.0 for Etsy and Badoo allowed to shut down almost half of the servers in web farms. Badoo ratedsaving a million dollars.
The graphs are indicative that at the time of the transition, the total processor utilization decreased by 2 times, and the memory consumption - by as much as 7 times.
On this joyful note, we’ll finish today's talk about PHP 7.0. But we will continue it very soon with PHP 7.1, in the optimization of which we went significantly further than data structures.