Introduction to compilers, interpreters and JITs

Since the birth of PHP 7, the debate about abstract syntax trees, just-in-time compilers, static analysis, etc. does not stop. But what do all these terms mean? Are these some magical properties that make PHP much more productive? And if so, how does it all work? In this article, we will look at the basics of how programming languages work and explain the process that should be performed before the computer starts, for example, your PHP script.

Interpreting the code

But before we talk about how this all works, let's look at one simple example. Imagine that we have a new programming language (think up any name). The language is pretty simple:

each line is an expression
each expression consists of a command (operator)
and any number of values (operands) that the command operates on.

Example:

set a 1
set b 2
add abc
print c

This is a simple language, so we can safely assume that this code just prints to screen 3. The set operator takes a variable and assigns a number to it (just like $a=1 in PHP). The add operator takes two variables to add and stores the result in the third. The print statement displays it on the screen.
')
Now let's write a program that reads each “expression”, finds the operator and operands, and then does something with them, depending on the particular operator. This is pretty simple to implement in PHP, as you can see in Listing 1.

Listing 1

 01. <?php 02. 03. $lines = file($argv[1]); 04. 05. $linenr = 0; 06. foreach ($lines as $line) { 07. $linenr++; 08. $operands = explode(" ", trim($line)); 09. $command = array_shift($operands); 10. 11. switch ($command) { 12. case 'set' : 13. $vars[$operands[0]] = $operands[1]; 14. break; 15. case 'add' : 16. $vars[$operands[2]] = $vars[$operands[0]] + $vars[$operands[1]]; 17. break; 18. case 'print' : 19. print $vars[$operands[0]] . "\n"; 20. break; 21. default : 22. throw new Exception(sprintf("Unknown command in line %s\n", $linenr)); 23. } 24. }

This is a very simple program, and you don’t have to write your next web application in your new language. But this example helps to understand how easy it is to create a new language and get a program that is able to read and execute this language. In our case, it reads the source file line by line and executes the code depending on the current operator. To run the application, we do not need to convert it to assembler or binary code, it works fine anyway. This method of program execution is called interpretation. For example, in this way Basic programs are often executed: each expression is read and immediately executed in high-level mode.

But there are a number of problems. One of them is that it is quite easy to write such a language processor, but it will be very slow to execute a new language. After all, we will have to process each line and check:

What operator need to perform?
Is this the correct operator?
Does it have the right amount of operands?

But we should not forget about other tasks. For example, the set operator can assign only numeric values to variables or string values too? Or even the values of other variables? To correctly process each expression, you need to answer all these questions. What happens if you write set 1 4? In short, it’s almost impossible to create fast-running applications.

But, despite the slow pace, interpretation has advantages: we can immediately launch the program after each change made. For attentive: when I change something in a PHP script, I can immediately execute it and see the changes; Does this mean PHP is an interpreted language? At the moment, we assume that yes. The PHP script is interpreted like our hypothetical simple language. But in the following sections we will come back to this!

Transcompiling

How can we make our program "work fast"? This can be done in different ways. One of these, developed on Facebook, is called HipHop (I mean the “old” HipHop system, not the HHVM used today). HipHop converted one language (PHP) to another (C ++). The result of the conversion could be converted into binary code using the C ++ compiler. His computer is able to understand and execute without additional load in the form of an interpreter. As a result, a HUGE amount of computational resources is saved and the application runs much faster.

This method is called source-to-source compiling, or transcompiling, or even transpiling. In fact, it is not compiling into binary code, but a conversion to something that can be compiled into machine code by existing compilers.

Transcompiling allows you to directly execute binary code, which improves performance. However, this method has a downside: before you run the code, we first need to perform a transcompiling, and then a real compilation. But this should only be done when changes are made to the application, that is, only during development.

Trans-compiling is also used to make “hard” languages simpler and more dynamic. For example, browsers do not understand code written in LESS, SASS and SCSS. But it can be transported in CSS, which browsers understand. Maintaining CSS is easier, but you have to transform further.

Compiling

In order for everything to work even faster, you need to get rid of the stage of transcompiling. That is, to compile our language immediately into binary code, which could be immediately executed, without additional workload in the form of interpretation or transcompiling.

Unfortunately, writing a compiler is one of the most difficult tasks in computer science. For example, when compiling into binary code, you need to consider which computer it will run on: on 32-bit Linux, or on 64-bit Windows, or generally on OS X. But the interpreted script can be easily run anywhere. As in PHP, we do not need to worry about where our script is executed. Although there may be code designed for a specific OS, which makes it impossible to run the script on other systems, but this is not the fault of the interpreter.

But even if we get rid of the trans-capillating stage, we cannot escape compiling. For example, large programs written in C (compiled language) can be compiled for almost an hour. Imagine that you wrote an application in PHP and you need to wait another ten minutes before you see whether the changes are working.

Using the best

If interpretation means slow execution, and compiling is difficult to implement and takes more time to develop, how do languages like PHP, Python or Ruby work? They are pretty quick!

This is because they use both interpretation and compilation. Let's see how it turns out.

What if we could transform our fictional language not directly into binary code, but into something very similar to it (this is called “bytecode”)? And if this bytecode was so close to how the computer works, that it would be interpreted very quickly (for example, millions of bytecodes per second)? This would make our application almost as fast as a compiled one, while retaining all the advantages of interpreted languages. Most importantly, we would not have to compile scripts with every change.

It looks very tempting. In fact, many languages work in a similar way - PHP, Ruby, Python and even Java. Instead of reading and interpreting lines of source code one by one, these languages use a different approach:

Step 1. Read the script (PHP) entirely in memory.
Step 2. Completely convert / compile the script into bytecode.
Step 3. Run the bytecode using the interpreter (PHP).

In fact, there are more steps, and in reality the whole process is much more complicated. But in general, these three steps are enough to run the script from the command line or to execute the request through your web server.

The process can be easily optimized: suppose we run a web server and each request executes the script index.php . Why load it every time in memory? It is better to cache the file so that you can quickly convert it with each request.

Another optimization: after generating the bytecode, we can use it with all subsequent requests. So you can cache it (most importantly, make sure that by changing the source file the bytecode will be recompiled). This is what the opcode caches do, like the OPCache extension in PHP: they cache the compiled scripts so that they can be quickly executed on subsequent requests without redundant downloads and compiling into bytecode.

Finally, the final step to high speed is bytecode execution by our PHP interpreter. In the next section, we will compare this with ordinary interpreters. To avoid confusion: such a bytecode interpreter is often called a “virtual machine”, because to a certain extent it copies the work of a machine (computer). Do not confuse this with virtual machines running on computers, like VirtualBox or VMware. We are talking about such things as JVM (Java Virtual Machine) in the world of Java and HHVM (HipHop Virtual Machine) in the world of PHP. Python and Ruby have their own virtual machines. In a way, they are all highly specialized and productive bytecode interpreters.

Each VM executes its own bytecode generated by a specific language, and they are incompatible with each other. You cannot execute PHP bytecode on a Python VM, and vice versa. However, it is theoretically possible to create a program that compiles PHP scripts into bytecode, which will be understandable by the Python VM. So in theory, you can run PHP scripts in Python (a serious challenge!).

Bytecode

How does the byte code look and work? Consider two examples. Take the PHP code:

 $a = 3; echo "hello world"; print $a + 1;

You can view its bytecode using 3v4l.org or installing the VLD extension . We get the following:

Now take a similar example in Python:

 def foobar(): a = 1 print "hello world", print a + 4

Python can directly generate operation codes © python:

dis.dis (func) :

We have two simple scripts and their bytecodes. Note that bytecodes are similar to the language that we “created” at the beginning of the article: each line is an operator with any number of operands. In PHP bytecode, the variable is prefixed with!, So! 0 means variable 0. Bytecode does not matter that you use the $ a variable: during compiling, variable names lose their meaning and are converted to numbers. This facilitates and accelerates their processing by the virtual machine. Most of the necessary "checks" are performed at the compilation stage, which also relieves the load from the virtual machine and increases its speed.

Since the byte code consists of simple instructions, interpretation is very fast. Instead of thousands of binary instructions that need to be processed for each expression of an interpreted language, in byte-code there are several hundred instructions for each expression (sometimes even less). Therefore, virtual machines run much faster than interpreted languages.

In other words, virtualka took all the best from two worlds. Although we still need to compile from source code to bytecode, this process becomes fast and transparent. And after receiving the bytecode, the virtual machine quickly and efficiently interprets it without unnecessary overhead. As a result, we have a high-performance application.

From source code to byte code

Now, when we are able to efficiently execute the generated bytecode, there remains the task of compiling the source code into this bytecode.

Consider the following PHP expressions:

 $a = 1; $a=1; $a = 1;

All of them are equally true and must be converted to the same byte codes. But how do we read them? Indeed, in our own interpreter we parse the commands, separating them with spaces. This means that the programmer must write code in the same style, unlike PHP, where you can use deviations or spaces, brackets in one line or transfer to the second line, etc., in one line. The compiler will first try to convert your original code in tokens. This process is called lexing or tokenization.

Lexing

Tokenization (lexing) consists in converting the source PHP code — without understanding its meaning — into a long list of tokens. This is a complex process, but in PHP you can do something quite similar. The code in Listing 2 yields the following result:

 T_OPEN_TAG <?php T_VARIABLE $a T_WHITESPACE = T_WHITESPACE T_LNUMBER 3 ; T_WHITESPACE T_ECHO echo T_WHITESPACE T_CONSTANT_ENCAPSED_STRING "hello world" ;

The string value is converted to tokens:

<? php converted to token T_OPEN_TAG,
$ a is converted to the T_VARIABLE token, which contains the value of $ a.

The tokenizer knows this when, when reading a code, it detects a $ sign with the letter a, after which any number of letters and numbers can follow. Numbers are tokenized as T_LNUMBER and can be one or more bits. Tokenization allows you to present the source code in a more structured form, without forcing the programmer to do it. But, as already mentioned, the tokenizer does not understand the meaning of tokens. It ideally tokens and $ a = 1, and 1 = $ a. And in the next part, we will learn to parse - set the value to the stream of tokens.

Parsing

When parsing tokens, we must follow some of the “rules” that make up our language. For example, there may be a rule: the first detected token in the program must be T_OPEN_TAG (corresponding to <? Php).

Another possible rule: an assignment can consist of any T_VARIABLE followed by the symbol =, and then T_LNUMBER, T_VARIABLE or T_CONSTANT_ENCAPSED_STRING. In other words, we allow $ a = 1, or $ a = $ b, or $ a = 'foobar', but not 1 = $ a. If the parser detects a series of tokens that do not satisfy any of the rules, a syntax error will be automatically generated. In general, parsing is a process that defines a language and allows us to create syntax rules.

See the list of rules used in PHP at . If your PHP script satisfies the syntax rules, additional checks are performed to confirm that the syntax is not only correct, but also meaningful: the definition of public abstract final final private class foo() {} may be correct, but does not make sense from the point of view PHP Tokenization and parsing are tricky processes, and often third-party applications are used to perform them. Often used tools like flex and bison (in PHP too). They can also be considered as transcompilers: they transform your rules into C-code, which will be automatically compiled when you compile PHP.

Parsers and tokenizers are also useful in other areas. For example, they are used to parse SQL expressions in databases, and PHP also writes quite a few parsers and tokenizers. The Doctrine object-relational mapper has its own parser for DQL expressions, as well as a “transcompiler” for converting DQL to SQL. Many template engines, including Twig, use their own tokenizers and parsers to “compile” template files back into PHP scripts. In fact, these engines are also transcompilers!

Abstract syntax tree

After tokenization and parsing of our language, we can generate bytecode. Up to PHP 5.6, it was generated during parsing. But it would be more usual to add a separate stage to the process: let the parser generate not a bytecode, but the so-called abstract syntax tree (AST). This is a tree structure in which the entire program is represented in the abstract. AST not only simplifies the generation of bytecode, but also allows us to make changes to the tree before it is transformed. The tree is always generated in a special way. The tree node, which is an if expression, necessarily has three elements under it:

the first one contains a condition (like $a == true );
the second contains expressions that must be executed if the condition true is met;
the third contains expressions that must be executed if the false condition is met (the expression is else ).

Even if else missing, element three, just the third will be empty.

As a result, we can “rewrite” the program before it is converted to bytecode. Sometimes it is used to optimize the code. If we find that the developer repeatedly recalculated the variable inside the loop, and we know that the variable always has the same value, the optimizer can rewrite the AST to create a temporary variable that does not need to be recalculated each time. The tree can be used for a small reorganization of the code so that it works faster: delete unnecessary variables, etc. This is not always possible, but when we have a tree of the entire program, it is much easier to perform such checks and optimization. Inside an AST, you can see whether variables are declared before they are used or if assignment is used in a conditional block ( if ($a = 1) {} ). And when potentially erroneous structures are detected, issue a warning. With the help of the tree, you can even analyze the code from the point of view of information security and warn users during script execution.

All this is called static analysis - it allows you to create new features, optimizations and validation systems that help developers write harmonious, secure and fast code.

In PHP 7.0, a new parsing engine (Zend 3.0) has appeared, which also generates AST during parsing. Since it is quite fresh, not much can be done with it. But the very fact of its existence means that we can expect the appearance of various possibilities in the near future. The token_get_all() function already accepts a new, undocumented TOKEN_PARSE constant, which in the future can be used to return not only tokens, but also parsed AST. Third-party extensions like php-ast allow you to view and edit the tree directly in PHP. A complete redesign of the Zend engine and AST implementations will open PHP for a variety of new tasks.

JIT

In addition to virtual machines running highly optimized byte-code generated from AST, there is another method for increasing speed. But this is one of the most difficult to implement things.

How is the application executed? It takes a lot of time to set it up: for example, you need to run the framework, parse routes, process environment variables, etc. After all these procedures are completed, the program is usually still not running. , - . , , (, ) ? , , , . , , , , .

. -, . JIT- (just-in-time, ). . , - , , . — . , . , .

JIT- . ; , ; , . JIT' : , .

JIT' , . JIT' runtime , . , JIT' .

HHVM, JIT-: PHP- ( Hack) -, HHVM. , ; , . -, .

PHP 7 JIT-, . , , JIT-. , PHP 7 !

Source: https://habr.com/ru/post/304748/

All Articles