How we made PHP 7 twice as fast as PHP 5. Part 2: Bytecode optimization in PHP 7.1

In the first part of the story, based on the presentation of Dmitry Stogov from Zend Technologies on HighLoad ++, we understood the internal structure of PHP. We learned in detail and first-hand what changes in the basic data structures allowed PHP 7 to be more than doubled in speed. We could stop at this, but in version 7.1 the developers went much further, since they had many more ideas for optimization.

The accumulated experience of working on JIT before the Seven can now be interpreted by looking at the results in 7.0 without JIT and on the results of HHVM with JIT. In PHP 7.1, it was decided not to work with JIT, but again to turn to the interpreter. If before optimization concerned the interpreter, then in this article we will look at bytecode optimization, using type inference, which was implemented for our JIT.

')
Under the cut, Dmitry Stogov will show how this all works, using a simple example.

Optimize bytecode

Below is the bytecode in which the standard PHP compiler compiles the function. It is single-pass - fast and stupid, but capable of doing its work on each HTTP request again (if OPcache is not connected).

OPcache Optimizations

With the advent of OPcache, we began to optimize it. Some optimization methods have long been built into OPcache , for example, slit optimization methods — when we look at the code as if through an eye, look for familiar patterns, and replace them with heuristics. These methods continue to be used in 7.0. For example, we have two operations: addition and assignment.

They can be combined into a single compound assignment operation, which performs the addition directly on the result: ASSIGN_ADD $sum, $i . Another example is a post-increment variable, which theoretically can return some kind of result.

It may not be a scalar value and must be removed. The following FREE instruction is used for this. But if it is changed to a pre-increment, then the FREE instruction is not required.

In the end, there are two RETURN statements: the first is a direct reflection of the RETURN statement in the source text, and the second one is added with a blunt compiler on the closing bracket. This code will never be reached and can be removed.
There are only four instructions left in the loop. It seems that there is nothing more to optimize, but not for us.
Look at the $i++ and its corresponding instruction — the pre-increment of PRE_INC . Every time it is executed:

need to check what type of variable came;
is_long it;
increment;
check for overflow;
go to the next;
possibly check the exception.

But a person, just looking at the PHP code, will see that the $i variable is in the range from 0 to 100, and there can be no overflow, type checks are not needed, and there can be no exceptions either. In PHP 7.1, we tried to teach the compiler to understand this .

Control Flow Graph Optimization

To do this, you need to deduce the types, and in order to introduce the types, you must first build a formal representation of data streams, understandable to the computer. But we will begin with the construction of the Control Flow Graph - a control dependency graph. Initially, we break the code into basic-blocks - a set of instructions with one input and one output. Therefore, we cut the code in the places where the transition occurs, that is, the labels L0, L1. We also cut it after the conditional and unconditional transition operators, and then we connect them with arcs that show the control dependencies.

So we got a CFG.

Static Single Assignment Form Optimization

Well, now we need a data dependency. To do this, we use the Static Single Assignment Form - a popular representation in the world of optimizing compilers. It implies that the value of each variable can only be assigned once.

For each variable, we add an index, or reincarnation number. In each place where assignment takes place, we put a new index, and where we use them, there are still question marks, because it is not always known everywhere. For example, in the instruction IS_SMALLER $ i can come from both the L0 block number 4 and the first block number 2.

To solve this problem, SSA introduces the pseudo-function Phi , which, if necessary, is inserted at the beginning of the basic-> block, takes all sorts of indices of one variable that come to the basic-block from different places, and creates a new reincarnation of the variable. It is these variables that are then used to eliminate ambiguity.

Replacing all the question marks in this way, we will build the SSA.

Optimization by type

Now we deduce types - as if we are trying to execute this code directly by control.

In the first block, the variables are assigned the values of constants - zeros, and we know for sure that these variables will be of type long. Next is the Phi function. Long comes to the input, and we don’t yet know the values of other variables coming from other branches.

We believe that at the output of phi () we will have a long.

Spread further. We come to specific functions, for example, ASSIGN_ADD and PRE_INC . We fold two long. As a result, you can get either long or double, if an overflow occurs.

These values again fall into the function Phi, the sets of possible types coming in different branches are merged. Well, and so on, we continue spreading until we reach a fixed point and everything is settled down.

We obtained a possible set of type values at each point of the program. This is already good. The computer already knows that $i can only be long or double, and can eliminate some unnecessary checks. But we know that double $i cannot be. How do we know? And we see a condition that limits the growth of $i in the cycle to a possible overflow. Teach and computer to see it.

Range Propagation Optimization

In the PRE_INC instruction PRE_INC we never learned that i can only be an integer - it is worth a long or double. This happens because we did not try to infer possible ranges. Then we could answer the question whether overflow will occur or not.

This range output is produced in a similar but slightly more complicated way. As a result, we obtain a fixed range of variables $i with indices 2, 4, 6 7, and now we can confidently say that the increment $i will not lead to overflow.

Combining these two results, we can say for sure that the double variable $i never become.

All we got is not yet optimization, this is information for optimization! Consider the ASSIGN_ADD instruction. In general, the old value of the amount that came to this instruction could be, for example, an object. Then, after the addition, the old value should have been removed. But in our case, we know for sure that there is a long or double, that is, a scalar value. No destruction is required, we can replace ASSIGN_ADD with ADD - a simpler instruction. ADD uses the sum variable as both an argument and a value.

For pre-increment operations, we know for sure that the operand is always long, and overflow cannot occur. We use a highly specialized handler for this instruction, which will perform only the necessary actions without any checks.

Now compare the variable at the end of the loop. We know that the value of the variable will be only long - you can immediately check this value by comparing it with a hundred. If earlier we wrote the result of the check into a temporary variable, and then once again checked the temporary variable for true / false value, now it can be done with a single instruction, that is, it can be simplified.

The result of the bytecode compared to the original.

There are only 3 instructions left in the loop, and two of them are highly specialized. As a result, the code on the right is 3 times faster than the original.

Highly specialized handlers

Any PHP crawl handler is just a C function . On the left is the standard handler, and on the upper right is the highly specialized one. The left one checks: the type of the operand, whether overflow has occurred, whether the exception has occurred. The right one just adds one and that's it. It is broadcast in 4 machine instructions. If we went further and did JIT, we would only need a one-time incl instruction.

What's next?

We continue to increase the speed of PHP branch 7 without JIT. PHP 7.1 will again be 60% faster on typical synthetic tests, but on real applications, this practically does not give a win - only 1-2% on WordPress. This is not very interesting. From August 2016, when the 7.1 branch was frozen for significant changes, we again started working on JIT for PHP 7.2 or rather PHP 8.

In a new attempt we use to generate the DynAsm code, which was developed by Mike Paul for LuaJIT-2 . It is good in that it generates the code very quickly : the fact that the minutes were compiled in the JIT version on the LLVM now takes about 0.1-0.2 s. Already today, acceleration on bench.php on JIT is 75 times faster than PHP 5.

On real applications, there is no acceleration, and this is our next challenge. Partly, we got the optimal code, but by compiling too many PHP scripts, we clogged the processor cache, so it did not work faster. And not the speed of the code was a bottleneck in real applications ...

Perhaps DynAsm can be used to compile only certain functions that will be chosen either by a programmer or by heuristics based on counters — how many times the function has been called, how many times the cycles are repeated in it, etc.

Below is the machine code that our JIT generates for the same example. Many instructions are compiled optimally: the increment is in one CPU instruction, the initialization of a variable to constants is two. Where types are not derived, you have to work a little more.

Returning to the title picture, PHP compared to similar languages in the Mandelbrot test shows very good results (although the data are relevant at the end of 2016).

The diagram shows the execution time in seconds, less is better.

Perhaps Mandelbrot is not the best test. It is computational, but simple and implemented in all languages equally. It would be nice to know how fast Wordpress would work in C ++, but there is hardly any eccentric ready to rewrite it just to check it out, and even repeat all the perversions of the PHP code. If you have ideas for a more adequate set of benchmarks - offer.

We will meet on PHP Russia on May 17 , discuss the prospects and development of the ecosystem and the experience of using PHP for really complex and cool projects. Already with us:

Nikita Popov , one of the most prominent developers of the PHP kernel, will tell you what to expect in the upcoming release of PHP 7.4.
Dmitry Stogov will tell about the grandiose plans in PHP 8 .
Anton Titov , author of the RoadRunner application server, will speak on the topic “ Developing PHP / Go hybrid applications using RoadRunner ”.

Of course, this is not all. Yes, and Call for Papers is still closed, until April 1, we are waiting for applications from those who know how to apply modern approaches and best practices in order to implement cool PHP services. Do not be afraid of competition with eminent speakers - we are looking for experience in using what they do in real projects and will help show the benefits of your cases.

Source: https://habr.com/ru/post/445512/

All Articles