📜 ⬆️ ⬇️

JPHP - How it works. History of creation

In this article I will tell in more detail about the history of the JPHP project and how it was developed from the technical side. The text will be of interest to both simple PHP developers and compiler lovers. I tried to describe everything in simple language.

image

JPHP is a PHP language compiler for Java VM. Two weeks ago I wrote an article about the project . Related projects - JRuby for ruby, Jython for python. After the publication of the first article about JPHP, the project scored 500 stars on a githaba in two days and managed to light up not only in Runet, but also on foreign resources, managed to be in first place in the github ranking.

And before the start of a little news.
')

Latest project news




Soon, the project will have a separate site and it will be very easy to try JPHP, various compiled versions of the engine will be laid out. So, let's move on to the topic ...

Project start


The project began spontaneously. Before that, I was looking for similar projects, i.e. php implementation for JVM. There is such a project as Quercus from Resin, it is a translator into Java code written in Java. This state of affairs did not suit me, the more so the authors of the project stated that their implementation works with the same speed as Zend PHP + APC. Until a certain time, there were other PHP implementations for the JVM (p8 and projectzero for example), but they all died and closed.

The main motivation for starting the project was performance and JIT. I have been talking to Java for quite some time now, JVM high performance, cookies, a huge community and high-quality libraries are what attracts me to it. And after thinking a little bit, I had an idea - to take the best of Java and implement the PHP engine on it. Having collected my sprawled thoughts, I started writing a test version, deciding that if jphp is at least 2 times faster than the original Zend PHP, I will continue to develop.

First of all, I walked through the repositories of all known JVM languages ​​(groovy, jruby, scala) and found out which set of libraries they use to generate JVM bytecode. As it turned out, there is a well-known third-party library - ASM . It is developing quite actively, has sufficient documentation in PDF, and it seems to support even Dalvik (Android) bytecode (more on this below).

Introduction to the Java VM


The Java Virtual Machine (JVM) is quite a powerful tool. How the JVM bytecode is organized can be found in the ASM library documentation. Briefly describe all the features of VM in the following paragraphs:

1. Virtual stack machine
2. It is possible to store local variables by indices (something like registers)
3. GC (garbage collector) is implemented at the VM level
4. Objects and classes are implemented at the VM level
5. A large number of standard operations - POP, PUSH, DUP, INVOKE, JMP, etc.
6. For Try Catch there are instructions bytecode, for finally - partially
7. For VM, there are several types of values: int32, int64, float, double, objects, arrays of scalars, arrays of objects, for bool, short, byte, char, int32 is used.

Thus, I realized that I would not have to implement the GC itself and the object class system from scratch.

Selection of goals and priorities


Before starting development, I thoroughly understood the main advantages of PHP, not only as a language, but also as a platform. The following things turned out to be the most obvious to me:



These advantages are also disadvantages. With each request, all classes are loaded again and this is not very good. The problem is partially solved by bytecode caches, but not completely. I thought that I could solve this problem by retaining the advantages that I listed above. What I got in the end:



Dynamic typing


At the Java VM level, there is no dynamic typing, but in PHP it is needed. This will seem like a big problem to many, but it is not.

To store values, I implemented the abstract class Memory . Values ​​JPHP stores as Memory objects. Next, I implemented the StringMemory, NullMemory, DoubleMemory, LongMemory, TrueMemory, FalseMemory, ArrayMemory, and ObjectMemory classes. As you probably understood by the class names, each of them is responsible for a certain type - numbers, strings, etc. All of them were inherited from Memory.

Memory consists of abstract methods that are needed to implement value operations, for example, for plus and minus operators, there are plus() and minus() methods that must be implemented in each Memory class. These are virtual methods. There are a lot of these methods, but there is a reason for this - various optimizations. To understand how this works, here’s the pseudo-code:

Pseudo code example
 $var + 20; //     //   $var    Memory $var->plus(20); //   plus      Memory $x + 20 - $y /*   */ $x->plus(20)->minus($y); 

Naturally, this is a pseudo-code, not a real php-code. This all happens under the hood, in baytkod.


The Memory object does not exceed zval objects from Zend PHP in memory consumption, and this is partly why JPHP and Zend PHP are roughly equivalent in terms of memory consumption. For many values ​​(for example, false, true, null, small numbers), objects are cached in memory and not duplicated. Does it make sense with every true create a new TrueMemory object? Of course not.

Dynamic typing failure and fix


As I described above, all values ​​are objects of the Memory class. At the very beginning I implemented autoboxing for simple constant values, i.e. for example, if:

 //    $y = $x + 2; //     (  ) $y->assign( $x->plus( new LongMemory(2) ) ); 


As you see, “2” turned into an object. It was convenient from the point of view of programming, but from the point of view of performance, it was a nightmare. This implementation worked for me no faster than Zend PHP.

I decided to go to the end. To avoid initiating such a large number of objects for simple values ​​(and present this code in a loop?), I decided to implement the plus () method and other similar ones for basic Java types - double, long, boolean, etc., and not just for Memory. I admit, it was a routine and smacked of govnokodom. I redid the compiler and he began to understand the types and what to do with them. The compiler has learned to substitute different types and different methods for operations depending on the type of elements in the stack. Yes, the stack is calculated at compile time. Although it would be possible to keep these constant values ​​in the table of constants, it would still be overhead, although not so big.

As it turned out, I started it all for good reason and the performance on synthetic tests began to overtake Zend PHP already more than 2-3-4-10 times, for example, on cycles, variables, etc.

Magic of variables


PHP is truly a magic language, I'm not exactly sure, but in what other language at runtime can you access the variable by name from a string? A simple example:

 $var = "foobar"; ${'var'} = 100500; $name = 'var'; echo $$name; 


To implement this magic, you must use a named table of variables - a hash table. JVM provides the ability to store variables by indices, of course, turning variable names into indexes at compile time is a more logical step, it provides very quick access to variables, faster than in the hash table. Compare what will be faster - searching the hash table or accessing an index on an array?

I first forgot about it and implemented the variables on the indexes. The performance of addressing variables was up to par. When I remembered such magic, I had to redo the table into a hash and ... everything was bad. The performance of work with variables fell literally 2-3 times.

Without hesitation, I decided to implement in the compiler 2 compilation modes of variables - in indices and with a hash table. The analyzer helps to reveal the code in which it is necessary to refer to variables by string. This happens on the following grounds:

  1. If the code contains expressions: $$var , ${...}
  2. There are functions eval, include, require, get_defined_vars, extract, compact
  3. Global scope


Why do I need to save variable names in a hash table in the code that contains require and include? Yes, everything is simple, PHP should transfer variables inside the included scripts, the same with eval. And the rest of the functions work with variable names.

Very often, you do not need such magic of variables, which means your code will run faster in JPHP. There is also a global scope for variables. In this area, the storage mechanism of variables in the hash table is automatically used, since it is assumed that global variables can be accessed through the $GLOBALS array by name and at any time.

Super - Global Variables
$GLOBALS, $_SERVER, $_ENV .. , for such variables it is not necessary to prescribe the global keyword . Their implementation is quite simple, the compiler knows in advance the names of super-global variables and, if such occur, then another code, an example of a pseudo-code, substitutes for access to such a variable:

Example of a super-global variable
 function test() { $a = $GLOBALS['a']; ... $GLOBALS['x'] = 33; } //        function test() { $GLOBALS =& getGlobalVar('GLOBALS'); //   $a->assign($GLOBALS['a']); ... $GLOBALS['x']->assign(33); } 



Arrays, Links and Immutable Values, GC


In PHP, arrays are copied, not passed by reference. However, copying an array does not occur at the moment of assignment = , since it would create a big overhead. Inside the JPHP engine, and even in Zend PHP, arrays are copied by reference, but at the time of changing the array it is copied (in the event that the number of references to the array is> 1).

JPHP does not use reference counting, it uses standard GC from Java, which can delete circular references as well. This creates problems when implementing such arrays. Therefore, I implemented a special mechanism for turning any Memory value into an immutable value. I will show the pseudo-code first:

 $x = array(1, 2, 3); $y = $x; $y[0] = 100; //    $y ,     $x //       (-): $x->assign( array(1,2,3) ); $y->assign($x->toImmutable()); //     $x $y[0]->assign(100); //    : $y =& $x; //   ->toImmutable    $y->assign($x); 

JPHP has another type of Memory object - ReferenceMemory, these objects simply store a reference to another Memory object. However, variables are not always stored as Reference objects, in some cases local variables can do without such references and use bytecode directly to write a new value to a cell, this works naturally faster than the usual method ->assign().

Reference objects return their real value from the toImmutable method, and the arrays in turn return a special reference copy of the array, which is copied at the first change. Therefore, to assign a variable a reference to another variable to the compiler, it is enough not to use the toImmutable method.

Implementing classes and functions


Classes already exist at the JVM level. Generally speaking, Java, Scala, Groovy, JRuby generate the same classes with different signatures within the JVM. JPHP when compiling php classes uses JVM classes with a special signature:

 Memory methodName(Environment env, Memory[] args) 

The Environment object is passed to each method and function; this object allows you to learn a lot about the environment in which the method runs. Classes and functions are compiled equally for all environments. Memory [] is an array of arguments passed to the method. The method should always return something, not void, because in PHP, even if the function returns nothing, it returns null, this is a tautology.

The php functions are also compiled into classes, since there is no such thing as a function in the JVM. At the output, we get a class with one static method, which in essence is our function. Why functions are not compiled into methods of one class? This is a good question, and most likely it is necessary to redo it in order not to produce extra classes, but so far it is more convenient.

JVM can easily load classes from memory at runtime, just write your Java class loader. Therefore, JPHP compiles at runtime and can load classes at run time and not all at once.

Long start and problem solving


The engine also implemented the ability to write classes on Java itself. However, not all so simple. All methods of such classes should have the required signature and be marked with some auxiliary annotations (for example, for type hinting), one of the examples of this class:

php \ lang \ System class
 import php.runtime.Memory; import php.runtime.env.Environment; import php.runtime.lang.BaseObject; import php.runtime.reflection.ClassEntity; import static php.runtime.annotation.Reflection.*; @Name("php\\lang\\System") final public class WrapSystem extends BaseObject { public WrapSystem(Environment env, ClassEntity clazz) { super(env, clazz); } @Signature private Memory __construct(Environment env, Memory... args) { return Memory.NULL; } @Signature(@Arg("status")) public static Memory halt(Environment evn, Memory... args) { System.exit(args[0].toInteger()); return Memory.NULL; } } 


As the number of new native classes for JPHP increased, I noticed that the time spent on registering classes increased. The more extensions and classes there were, the longer was the delay before the engine was launched. It bothered me. And the idea came - how to fix it.

PHP as a language has a lazy class loading mechanism, everyone knows about it. I just used the lazy class loading mechanism to register native classes. When the native java class is registered, it is simply registered in the table of names, and de facto registration occurs at the time of the first use of this class. Having implemented this mechanism, I got good results, the engine initialization time decreased by 2-3 times, and the test passing time decreased from 24 seconds to 13 seconds. Due to this, the number of native classes will have almost no effect on the initialization speed of the engine.

The speed of starting the engine is especially important in GUI applications.

Problems with JVM


1. Naming JVM classes. JVM enforces class naming standards. If you write a class file path in bytecode, the JVM checks the naming convention of this class as in the Java language. This is a bit like the PSR-0 standard. However, if the class is placed in the global jvm package, this check does not occur. PHP can store as many classes and functions in one file, and they can have any fancy names. Therefore, I had to unbind the binding of php class names to the names of JVM classes inside bytecode. But this is not the only reason for this choice ...

2. Unique class names. JPHP should be able to save the received bytecode to a file and upload it to any environment, therefore all classes at the jvm level must have unique names so that there are no conflicts. While loading jvm bytecode, you cannot change the class name, at least I haven't tried it yet. For now, as a workaround, I generate a random class name for the JVM based on UUID + some things. I think this is not a very elegant solution, in the future I hope there will be better. You cannot use the name of the file in which the class is located, because the code may not be located in the class at all, but the bytecode file may be transferred from computer to computer and its name may change.

3. Limitations of reflection. Through Java reflection it is impossible to call a method in the context of a parent's class, i.e. something like super.call() , and in php parent:: . Of course, invokeDynamic was introduced in Java 7, which allows this to happen, but it works surprisingly slower than reflection. The failure of invokeDynamic was primarily due to poor performance in Java 7, although this problem was solved in Java 8 and now they are the same in speed (maybe I’m not preparing it correctly?). In addition, I wanted support for Java 6 and more easy adaptation for Android, in which I suspect no invokeDynamic, but there is reflection.

I solved this problem not quite elegantly, I had to abandon the standard mechanism for redefining the methods of the jvm classes, and therefore the names of the inherited methods at the level of the jvm classes are different - by the algorithm method_name + $ + index . This solution did not create and does not create any problems, but it solves the above described problem.

How Traits were implemented


Traits are a multiple inheritance mechanism introduced in PHP since version 5.4. In fact, it works as a copy-paste. In the implementation of JPHP, it also happens, though not copying the AST tree, but copying the compiled bytecode. Of course, there are no traits in the JVM, but JPHP compiles traits into regular JVM classes and it controls the trait limits itself, for example, it does not allow creating objects from the traits or inheriting from the traits.

Thus, you can easily re-use the JVM-compiled bytecode treit without having the original source code. There is nothing complicated about copying at the JVM level, the ASM library is easy to handle. The only thing is that in some places it is necessary to generate a slightly different bytecode than in regular classes. For example, this happens with the constant __CLASS__ in treytah.

 trait Foobar { function test(){ echo __CLASS__; } } class A { use Foobar; } $a = new A(); $a->test(); 


JPHP normally replaces the constant __CLASS__ at compile time with a string with the name of the class in which the code is located. It is impossible to do this with traits, and you have to calculate the value of this constant dynamically at run time, if it occurs in traits. But __TRAIT__ constant in treyta works the same as __CLASS__ in classes. It also comes up with self and self::class expressions. Copying properties is quite simple, so it makes no sense to describe it.

Denial of the Zend runtime libraries


Here, by libraries, I mean extensions written in C using zend api, including the standard PHP functions. Somewhere in the month I implemented them - functions for strings, arrays, etc. as in php.The most annoying thing is that even after reading the description of some functions I couldn’t drive in 1-2-3 times, what will happen when passing various options of arguments, what a result to expect. In php, functions are too universal, a huge amount of functionality is crammed into one function and this makes it very difficult to implement such functions.

At some stage, I realized that I did not implement these functions at such a level that wordpress or symfony could be run on JPHP, for example. And the attitude to the project from the side would be approximately as follows:

A conservative developer : “Why do I need JPHP if you cannot run wordpress, symfony, yii or another well-known project on it, but when you implement all the zend libraries, then I will think about it. In the meantime, I'd rather look at HHVM. ”

Progressive developer : “You implemented JPHP and repeated the entire php curve and ugly runtime, all inconsistent functions and classes, and why did you have to spend time on it?”.


I realized that giving up Zend Runtime is a very good idea. PHP is often blamed for runtime curves, curves and inconsistent functions. And it was decided to write my own runtime, I think active developers who like to try something new and experiment will not turn away from the project.

New features and classes to replace Zend Runtime


I decided to select all core classes and functions that will necessarily go out of the box in JPHP into a separate namespace phpso as not to clutter up the global namespace. Among them are the following:

  1. php\lang\Module- The mechanism for loading sources without include and require. Allows you to download a file (like include), but not executing it immediately, but only at the request of the programmer. In addition, the class provides the opportunity to get information about which classes and functions are inside. It will be able to download source codes from any Stream objects.
    With the advent of the class autoloader mechanism, I see no reason to use include and require, except in the class loader handler.
  2. php\io\Stream — fopen, fwrite, .. , typehinting , Stream , , ..
  3. php\lang\Thread, php\lang\ThreadGroup — . , 2 .
  4. php\io\File — , , File Java,
  5. Java , , , , java Java
  6. php\lang\Environmentisolated environments for code execution, support options: HOT_RELOADfor hot-swapping code and CONCURRENTfor using the same environment in several streams.


This list is not yet complete, there was no time to think about it properly, therefore there are so few classes.

JPHP testing


I think many wonder how such a complex project can be done alone and so that nothing breaks as it is developed. This problem is almost 100% solved by unit tests. It is very easy to test the programming language engine, you can immediately see what and how to test.

The first time I wrote my own simple tests, at that time I could not use complex zend language tests. But as JPHP developed, I began to gradually introduce zend tests, which can be found in the source code of php itself. They are also not perfect, sometimes you had to edit them, because third-party functions were used in the tests. You will understand, here is an example: a test for testing set_error_handler, a function is used inside the testfopen. In my opinion this is very wrong, why should the extension function take part in a test for one of the basic parts of the language? A typical unit test from zend consists of several sections, for example:

Unit test for traits
 --TEST-- Use instead to solve a conflict. --FILE-- <?php error_reporting(E_ALL); trait Hello { public function saySomething() { echo 'Hello'; } } trait World { public function saySomething() { echo 'World'; } } class MyHelloWorld { use Hello, World { Hello::saySomething insteadof World; } } $o = new MyHelloWorld(); $o->saySomething(); ?> --EXPECTF-- Hello 


, : --TEST-- , --FILE-- , --EXPECTF-- — , .


With the help of Zend tests, I was able to fix a huge number of bugs and inconsistencies with the PHP language, especially in OOP (and there believe, a lot of non-trivial behavior). The implementation of the traits also occurred through the introduction of zend tests and when I write that such a feature was implemented, this means that it passes jphp and zend unit tests.

Android and Dalvik


Dalvik is a completely different type of virtual machine than a JVM, it performs a different bytecode format and is itself a register one, not a stack one. JPHP is a compiler for a JVM machine and of course it compiles a bytecode incompatible with Dalvik. However, Google has kindly provided an interesting utility for developers to convert JVM bytecode to Dalvik bytecode. There is a Ruboto project from JRuby, which can also help with reference points - where to go.

Android is certainly a promising direction, but until it is time to port JPHP to this platform. Only when the project reaches version 1.0, when it becomes stable, only then will it make sense to me.

Where to go next?


WEB
Yes, it is possible. JPHP allows you to write your web-server entirely in php, just as it is written in Node.js, Ruby and in other languages. At the same time, he will provide out of the box a flexible and customizable hot-reloading mode for hot-swapping the code.

JPHP will allow you to write very productive servers, giving mechanisms for accessing shared memory between requests. This will allow writing a completely different plan in php. If you have heard of Phalcon, then this is something similar, it is only written in C. JPHP will provide you with the opportunity to write such a framework, with complex logic, with high performance in php, and not in the form of a complex C or C ++ extension. In a pinch, you can write a Java extension for bottlenecks, which is much easier than writing an extension in C or C ++.

GUI
Yes, many write off desktops, because everything goes to the web. But this area is still relevant and in demand. JPHP will allow you to write gui applications, it already allows, there is an extension for Swing, in the future it may appear for JavaFX, which supports HTML5 and CSS3.

Android
Of course, this is still a distant prospect, but it is. In addition, JPHP may already run on ARM devices where there is an Oracle VM, for example, on Raspberry Pi.

Conclusion


PHP turned out to be quite a capricious patient, but in the end the operation was successful =). I understand that many third-party developers do not like this language, that it is often criticized, but it is worth looking at the language from the other side. I myself use PHP when I need to quickly do some kind of prototype, a front end for something, and I often use it to write functional tests.

Thanks for attention.

Source: https://habr.com/ru/post/218021/


All Articles