The key to high performance applications is optimization. At the same time, the lower the level at which the code is adjusted to the characteristics of the hardware - the more can be achieved. And even more impressive results can be achieved when the design of the hardware takes into account the features of the code. Today we will talk about how we are working to accelerate PHP 7, which is jointly, almost continuously, led by Intel and the PHP community of developers.

One aspect of this work is that Intel is sharing knowledge about its technologies, and developers are introducing new features into the core of PHP, improving the code. The performance of fresh assemblies is tested, on the basis of tests they draw conclusions, look for ways to solve problems.
For example, on the
Languages ​​Performance website, you can monitor the performance dynamics of PHP based on daily updated indicators. Here there is information about other platforms. Among them - Python, Node.js, HHVM.
')
If we talk about PHP, then the aforementioned efforts led to the fact that PHP 7, in comparison with PHP 5, demonstrates almost double the performance increase on servers with
Intel Xeon processors . This significantly reduces the need for PHP in memory. This is a very important achievement for the interpreted language, which is used in projects with very high loads. Please note that we are talking about the results that can be achieved on real projects. In addition, the performance of PHP 7 shows growth with each new generation of the Intel architecture (IA), which introduces new features that improve the hardware efficiency of code execution.
Improved memory management and faster execution
One of the main factors affecting application performance is how they work with memory.
Modern microprocessors have several levels of cache memory, for code and data, and each new generation of Intel Xeon processors differs from the previous one in increased cache size. However, the cache does not cease to be the most valuable resource, its volume, as compared with the RAM installed in modern systems, is still very small. At the same time, the main occupation of classical interpreters, such as PHP, is reduced to intensive work with memory, to copying and moving large amounts of data.
Execution of applications with significant code size, requiring serious memory and working intensively with it, increases the likelihood that the processor will not be able to take full advantage of the first and second level cache memory. This limits the performance, does not allow the application to fully utilize the available computing resources.
Most of the work that Intel has done to improve the speed of PHP, is aimed at improving the efficiency of working with memory.
50% reduction in memory consumption
The first major improvement in PHP 7 is reduced memory consumption. Using the results of research by Intel, as well as its own extensive experience, the PHP community seriously reduced data structures in PHP 7. In particular, zVal was reduced by 33%, the data structure of the bucket (part of the hash table) was also reduced by 33%, the memory manager was optimized , which led to
an increase in memory
efficiency by 75%. The overall need for PHP 7 memory has been reduced by about 50% compared to PHP 5. This alone, for example, gave a 30% performance improvement to WordPress projects using servers with Intel processors.
About PHP 7 Performance
To research PHP performance and search for optimization capabilities, Intel uses Zend reference workloads and tests. Once bottlenecks are discovered through similar testing, Intel passes information to the community, providing guidance on how to speed up the most important modules of the system. The community optimizes the code. As a result, PHP 7 has achieved a performance gain of about
two times .
In terms of achieved performance, I would like to note two important points.
First, during optimization and testing, workloads were used that were as close as possible to those created by real applications. The demonstrated performance increase was not obtained after the synthetic tests were performed, but after testing on real-world applications, such as WordPress, Drupal, MediaWiki. For example, WordPress works, on average,
twice as fast on PHP 7 running on a server with an Intel processor, rather than PHP 5. Tests were also conducted using highly specialized tests aimed at exploring the architecture of Intel.
Secondly, PHP 7 enhancements affect all applications. The transition to PHP 7 does not require additional efforts to optimize applications to work on processors with Intel architecture. In other words, when testing performance, standard tests were used that were not specifically optimized for analyzing the performance of PHP 7. After upgrading PHP to the seventh version, on servers equipped with Intel Xeon processors, you can immediately see an increase in application performance.
Recommendations for further improving PHP performance
The transition to PHP 7, if you do not take into account other factors, gives an improvement in performance, but there are two things that will increase productivity even more.
- The first is to use servers built on Intel Xeon processors.
- The second is in building PHP using Profile-Guided Optimization (PGO).
Intel Xeon and performance improvements
Intel is investing heavily in optimizing both hardware and software. Each new generation of Intel microarchitecture contains improvements that increase the efficiency of the hardware components, and software performance. A simple transition to a new generation of processors is capable of producing a significant, sometimes very serious, increase in productivity. In addition, Intel is studying and optimizing software to ensure that applications can run quickly on IA processors, making full use of the available hardware capabilities.
One of the main conditions for improving the performance of PHP was Intel's extensive research on how programs written in dynamic and interpretable programming languages ​​work at IA. Based on a thorough analysis, Intel conducted low-level optimizations of Intel Xeon processors so that they could better process interpreters such as PHP, HHVM, Python, and Node.js.
To further improve performance, Intel and the PHP community optimized the PHP core in order to take advantage of the improvements in micro-architecture that were introduced in the new generation of Intel processors.
When new hardware capabilities emerge, Intel identifies specific areas in PHP that can be optimized. Such optimizations, such as those related to the libc library, lead to increased productivity for a wide range of applications.
Here is a summary table of major improvements that help improve PHP performance.
Hardware optimization
| Description
| Performance Benefits
|
Improved memory management
| Increased cache and associative translation buffer (TLB) sizes, faster operation code handlers.
| Helps reduce the performance impact of large PHP memory needs. Although PHP 7 has less memory requirements than previous versions, it is still significant. Improved memory management helps reduce the impact of PHP needs on performance.
|
More advanced algorithms to reduce branch prediction errors
| The use of more advanced algorithms helps to reduce the constraints that occur at the entrance to the processor pipeline for cases when the instruction is not in the cache, or if the branch predictions are incorrect. As a result, the number of lost CPU cycles during proactive code execution decreases.
| All interpreted programming languages ​​are characterized by large delays at the entrance to the pipeline. New algorithms make it possible to more efficiently execute PHP programs and other interpreted code.
|
Intel Advanced Vector Extensions 2 (AVX2) and Intel Streaming SIMD Extensions (SSE) instruction sets
| New set of 256-bit instructions in Intel Xeon processors.
| When switching to SSE / SSE4 or AVX implementations of the libc API, such as memcpy (), memset () and memcmp (), the performance of PHP operations increases, which implies the performance of memory operations. These SIMD instructions can improve performance. For example, compiling HHVM with AVX2 enabled allows for an immediate performance increase of 5% for multi-threaded WordPress.
|
Manual optimization of low-level code.
| The assembler code of the responsible modules, such as memset () and memcpy () in HHVM, is manually optimized. Similar improvements have been made to PHP 7, Python (in both v2.7 and v3.5), and for other platforms.
| By manually rewriting the assembler code in HHVM, Intel found significant performance gains on the servers. Here is an example of just one optimization in the PHP interpreter: Intel patch to the fast_memcpy () function. Optimization has expanded this feature with some specialized instructions from Intel SSE2. This approach helps interpreted languages ​​to work better and faster, to fully utilize hardware capabilities, such as large third-level cache sizes (Last Level Cache, LLC).
|
â–Ť Increased cache sizes and non-cacheable entry
The choice of a suitable platform, in this case we are talking about a server based on Intel Xeon, it is very important to ensure high performance PHP solutions. This is because the most advanced servers from Intel, on Xeon v3 and v4, make better use of large-sized cache. The efficiency and speed of code execution also benefits from the active work of Intel in continually improving the libc library and other components to improve PHP performance. This work includes providing support for new hardware features available on Intel Xeon processors.
For example, imagine copying a large amount of instructions performed during PHP 7 initialization. On desktop computers, it is preferable to use non-cached writing for such operations. Such a recording occurs when data gets directly into RAM, bypassing the cache. Directly writing to memory can significantly reduce the time to perform operations on copying large amounts of data during initialization. However, given the overall acceleration of PHP 7, there is no noticeable difference in runtime when comparing this approach with regular writing.
Intel research has shown that switching to normal write to memory on servers with Intel Xeon v3 gives
an increase in PHP 7
performance by 2.9% for WordPress, and by 5.9% for MediaWiki. What is perceived on desktop systems with processors equipped with 4 MB LLC, as “cache pollution”, turns out to be “warming up the cache” on server processors with a cache of 45 MB. In fact, it shows how a large Intel Xeon cache size can improve the performance of some tasks.
â–ŤPerformance Performance Impact for Noncacheable Write Operations
The above does not mean that the non-cacheable entry in server applications is irrelevant. It is still required in special cases, for example, when it is necessary to avoid contamination of the L1 and L2 cache. However, in terms of the initialization phase of PHP 7, a regular entry is preferable. That is why the patch from Intel uses a regular entry in fast_memcpy (), as was done earlier for memcpy () from libc.
The fast_memcpy () function in PHP 7 is faster than memcpy (). This is because this function directly implements the copying of chunks of memory based on SSE2 instructions without adding unnecessary operations related to address alignment or size checking. Instead, the fast_memcpy () function relies on the fact that memory areas are 64-byte aligned. This is true when executed on virtually any processor. Some additional software transformations are also involved here.
The alignment constraints in fast_memcpy () are usually executed only during initialization. For frequent copying operations that occur during normal program execution (in most cases this concerns copying strings), the alignment restriction is not satisfied and you should use memcpy () from libc here. At this time, a non-cacheable entry in PHP 7 is not used.
Using PGO when compiling PHP: speeding up for specific tasks
Profile Optimization (Profile-Guided OptimizationPGO) is a technique that is currently used by a few open source projects. Most administrators deploy PHP 7 servers by simply installing already compiled packages. However, PGO is easy to use, and as a result, you can get quite a noticeable performance improvement. We recommend using PGO when compiling PHP 7.
â–ŤTips for compiler
When preparing any program, the compiler uses heuristic methods to determine how exactly it will be executed. Then the code is optimized taking into account the previously made assumptions. PGO replaces standard heuristic algorithms with specific statistical performance indications. This data is obtained at the stage of profiling the target code, when it is executed as it would have been performed in real work. Then the compiler uses the collected data as optimization hints in order to get the executable code, better tuned to the tasks that it has to solve.
Using PGO, the GCC compiler allows you to get better quality executable code that takes into account the intensity of use of various parts of the program. Rarely used code moves in a memory area that is not often used. As a result, optimized use of memory pages and caches in the architecture of Intel. In addition, this leads to an increase in performance due to the presence of a large amount of L3 cache for Intel Xeon v3 and v4 processors. Since the program execution paths are optimized, this approach may also lead to a small decrease in the misprediction of branching.
The results of rigorous testing, which can be easily reproduced, obtained in the Intel 0-Day Lab, allowed us to test the effectiveness of PGO on real-world PHP usage scenarios. In particular, 0-Day Lab investigated the performance of the usual PHP 7 assembly during WordPress versus the PGO assembly, which was prepared using the results of code profiling during WordPress execution. The results showed a performance increase of 7-8%, while no changes were made to the source code. Intel research has resulted in PHP 7 now offering support for building profiled code.
â–ŤPGO preparation
Deploying profiled assemblies is a relatively simple task. Here's how to prepare profiled PHP 7 executables and train the compiler so that it can generate code optimized for a particular task.
$ make prof-gen # $ { PHP , } $ make prof-clean # $ make prof-use #
â–ŤMicro-benchmarks for learning PGO
Based on the results of experiments with PGO, Intel has begun to create micro-benchmarks that reflect scenarios from the real world. These scripts can be used by creators of distributions on Linux, for example, to prepare installation packages that are optimized for various real-world tasks. You can find a draft version of these tutorials on
GitHub .
Intel Project 0-Day Lab
Intel provides the PHP community with comprehensive IA information for code optimization. The
Intel 0-Day-Lab project is a resource that is one of Intel’s most important PHP optimization tools.
At Intel 0-Day-Lab, PHP source codes are downloaded daily, assembled, and measured. Testing results that highlight recent changes (occurred within 24 hours) are sent to interested developers. The reports include information on performance changes compared to the previous day's tests, and warnings if some parts of the code could not be collected. Daily notifications also include a comparison of the current build with a control release. This allows kernel developers to see if their patches improve or degrade the overall performance of PHP, giving them a general idea of ​​how PHP is developing in terms of performance.
Sample results from Intel 0-Day Lab for a PHP script on a server with an Intel Xeon processorFor interpreted languages, each percentage of growth or drop in performance is important. Since Intel 0-Day-Lab uses an exceptionally stable measurement
platform , you can identify changes in performance around 1% with a relative standard deviation of less than 0.2%. Over time, such small improvements can lead to a significant increase in productivity.
The quality of the tests performed is indicated by the fact that in February 2016 Zend integrated the MySQL configuration methodology from the 0-Day Lab into its own internal testing system. This, for example, allowed WordPress to improve performance by 30%.
»
Here you can learn more about the Intel 0-Day Lab project and subscribe to the newsletter.
Developer Tools for PHP Code Optimization
In addition to collaborating with the PHP developer community, Intel also works globally, offering the following tools that focus on code performance research and optimization:
VTune Amplifier . A toolkit that is very useful in code profiling for Intel architectures. PHP-based software systems are good candidates for profiling with VTune.
Branch Hinting Tool . This is an open source analytical software package. It provides additional information about the branches in C / C ++ applications, based on the gcov statistics obtained during the representative execution of executable files equipped with measuring tools. This tool is used to research the core of PHP, but it can be used to study any C / C ++ projects.
PHP PGO Training Scripts . A set of open source PHP scripts created by Intel. These scripts can be used as training scripts that mimic the actual load when building Zend PHP using PGO. It is expected that the scripts will be included in the official PHP package.
Results
In order to help identify the performance problems of PHP and other languages ​​with large memory needs, Intel began an in-depth analysis of the processor pipeline. This analysis has already made it possible to find out that workloads, like WordPress, waste from 40% to 50% of wasted cycles at the entrance to the pipeline (Front-End), where sampling and decoding of instructions are performed. The two main reasons for this downtime are ICACHE misses and branch prediction errors.
This is a very high percentage of downtime. For optimum workload, they should not exceed 20%. This is a common problem specific to interpreted languages, such as PHP, because, again, they often have a greater need for memory for executable code. Currently, Intel is engaged in optimizing the micro-architecture, which will speed up the execution of interpreted languages. Namely, it is planned to achieve the following:
- Reduce downtime at the entrance to the pipeline and delays caused by ICACHE misses and incorrect branch predictions.
- Improve memory management for PHP, HHVM, Python and Node.js.
- Reduce the loss of processor cycles when proactively executing instructions (in a situation where instructions are executed in advance, based on branch predictions, and then execution is canceled).
- Improve overall execution speed of interpreted applications.
The speed of applications depends on the code and the hardware, or rather, on how well they correspond to each other, how much they are adapted to each other. In essence, the interaction between Intel and PHP developers is a search for ways to most efficiently interact all components of server solutions. And judging by what has been achieved when optimizing PHP 7 for IA, this approach allows for impressive results.
In conclusion, I would like to note that the work on improving the interaction of interpreted code and Intel hardware continues, and this concerns not only PHP, but also
other software platforms .