Interpreter bottlenecks

This note is intended for young programmers who have been using or interpreting programming languages for some time, but have not yet studied the working principle of the language itself.

Nowadays, due to potentially not bad salaries and office-type work, programming has become quite popular among young people. In addition, the programming languages that are not difficult enough for initial development are in demand: JavaScript, PHP, Perl, Python, Java, C #, Basic, ... (as you can see, all of them are of the same family - interpreters). As a result, a sufficiently large number of workers in this industry appeared who did not study programming anywhere else. A programmer was required for the language “X”, bought the book “X in 2 weeks” and after 3 weeks - we are already writing a project on “X”. And after a few thousand lines of code, or after the database has acquired the real data, the project begins to slow down mercilessly. You can, of course, “go play the drums” until the iron grows to your project, but this option does not always and not everyone.

What is usually the main problem? Usually in the absence of understanding: what actually happens when the “Y” command is executed. Programming languages as communication languages - the same can be explained in different words. But in the case of computers, it would be better if the explanation is as concise as possible. “Better” - in the sense of “speed of execution”. Moreover, brevity should be at the level of a language that is understandable to the central processor, and not to you. I mean that the brevity of the name of a function that you call in a particular programming language does not affect performance (there are exceptions); performance is affected by what this function does in its depths. And for this it is worth understanding how the computer actually works with numbers, strings, arrays, functions, and so on.
')
So that this note does not grow too much, I will not describe here what and how it works at a low level. In some languages, this explanation may differ. Detailed information can be searched on the Internet, books, or figure it out. If you wish, I can answer your questions in the comments and collect them all in a separate note. In the meantime, I will only explain “where your legs grow from” from common omissions and what you should pay attention to.

First, let's understand the classes of programming languages. I would break them into 3 groups (perhaps they are broken up as well):

Assembler
Compilers
Interpreters

Assembler - is, by and large, the language of the CPU itself. What the programmer writes on it, even though he writes in a human-readable syntax, remains the same at the output, only a processor-understandable syntax. Some assembler compilers, in addition to this simple translation, also analyze your code and try to optimize it, but this does not change the essence. If you know how to program in assembler, it means that you know all the nuances of the work of iron and, therefore, you have the opportunity to realize any task as optimally as possible.

Compilers are more programmer-friendly languages: C, C ++, Pascal, ... They are much easier to write because the things like conditional transitions, loops, work with variables and functions are derived into the syntax of the language. As a result, it is no longer necessary to write a lot of CPU commands to implement a complex cycle. Plus, they implement various designs that the CPU (central processing unit) is not aware of at all. But which make it much easier to structure the logic of the program (classes, objects, records, arrays, ...). When compiling, the program is translated into a language understandable to a specific CPU. Since the Assembler and the language understood by the CPU are essentially the same, you can always translate the compiled program into Assembler (disassembly). Translating a compiled program into a compiled language is a much more complicated task, since some constructions of the processor language cannot always be transferred normally to the simplified syntax of compiled languages. In addition, all names of variables and functions are lost during compilation, and it is not possible to restore them (except when the program is compiled in debug mode).

Interpreters - these languages represent the highest stage of evolution: JavaScript, PHP, Perl, Python, Java, C #, Basic ... Their peculiarity lies in the potential independence of the application execution platform.

Programs written in assemblers can only work on the type of processors for which they are written, since they were written by the teams of these processors and others simply will not understand them.

Programs written in compiled languages work only on those platforms for which they were compiled. Theoretically, a program can be compiled for different platforms, but in practice, if necessary, even at the programming stage, it is necessary to take into account the peculiarities of all the platforms for which the program is supposed to be compiled.

Programs that were written in interpreted languages are executed by a certain interlayer program that reads your code in real time and translates it into a language understandable by the CPU. As a result, the question of the portability of your application developers interpreters took over. Now they have to make this layer for different systems so that your program works for them all. But since all systems are quite different, it is not always possible to realize absolute independence. If under Linux there is a function “Z”, and under Windows it does not exist, then you will have to either do without it, or your program will work only under Linux (for example, the functions of working with the file system).

The main disadvantage of interpreted languages is the speed of their implementation. It is quite obvious that a program compiled into a language understandable by the CPU is processed by the CPU immediately, while a program written in an interpreted language must first be recognized and translated into a language understandable by the CPU, and then the CPU begins to execute it. Modern interpreters have acquired a number of measures to combat this shortcoming. In addition to a sufficiently high-quality optimizer and caching system, they translate your program into bytecode (either in real time or simulating compilation). Now the interlayer program does not need to recognize your “handwritten text” every time. It does this either only once, or does not do it at all (if the program has already been translated into byte-code). Instead of your “manuscript”, it works with the bytecode of your program. The byte code is very similar to the language of the CPU, but it is not the language of the CPU (it is more platform-independent). It still needs to be translated into the CPU language. Therefore, it is obvious that the rumors about Java, which is faster than C ++, are noticeably exaggerated. And this will remain so until the processors learn to understand Java bytecode.

Now, after a small general description of the interpreters, I would like to point out 3 topics that can be skipped when writing a small project, but which, at times, can give a significant performance boost when they are understood and used correctly.

Language features that are already compiled

Programming languages are not only syntax. It is also a set of ready-made libraries of functions for working with various data and devices. In compiled languages, they are not much different, but in the interpreters there is a difference. In some languages, such as Java, these functions are written in the language of the interpreter itself. And in some, such as JavaScript or PHP, in compiled languages, that is, they, at the time of program execution, are already compiled and do not require additional processing . Thus, their call will not require any additional processing, with the result that their execution will be much faster than if you write the same thing in this interpreted language. Therefore, if you have the opportunity to perform this kind of built-in function, even if it does something superfluous, but it solves your problem, try using it instead of writing your complex or not so much constructions. For example, to split a string into a set of substrings with a complex condition, it is better to use a regular expression, rather than writing your own loop with manual processing of the same string.

Difficult in structure, but easy to use, frameworks

In addition to a set of libraries of functions, some enthusiasts are also trying to transform the syntax and logic of languages, introducing some of their ideas that simplify something when working with structures and / or data (jQuery, LINQ, ORM, ...). If the language is compiled, then it is not so scary. But in the interpreters, blind immersion in third-party abstract functions is detrimental. Yes, often with such converters is really more convenient, but this convenience is almost always achieved due to the speed of work. Just look at the source code of these “helpers” and make sure that it is sometimes much more efficient to call a couple of functions built into the language that perform specifically what you need than one universal third-party that internally performs the “ton” of code before you realize what you want it and finally do it. For example, in JavaScript, to retrieve all DIVs, you can directly call the built-in function “document.getElementsByTagName (“ DIV ”)”, which will immediately return what you need, or call the beautiful jQuery function “$ (“ DIV ”)” which will perform a couple of regular expressions, a few checks, a “manual” union of arrays and only after that will it return the required one.

Working with strings

And finally, the last thing I wanted to give your attention to is working with strings. In interpreted languages, working with strings has become so transparent that the fact that these are one of the most resource-and-cost operations is absolutely not obvious. This fact is usually known only to those who worked with them at least manually in the compiled languages (there also have functions that facilitate this work). The problem is that with almost any operation with strings (creating a string, concatenating strings, splitting into substrings, deleting a substring, replacing the substring), the search for free space in the memory, the necessary length, for a new string, and copying the resulting data to a new location is included . Even such simple, at first glance, operations like searching by string, with the arrival of such complex formats as UTF-8, are not particularly fast. Compared with work in ASCII format. Therefore, you should not abuse the lines where you can do without them . For example, associative arrays - if you can get around a numbered array, get along!

It is worth noting that in a function that does almost nothing, you may not feel the difference in performance between optimized code and quick-and-dirty code with modern processors. The difference will be more obvious in places where the “quick fix” code is executed many times (in a loop, in a frequently called function) or where there is a lot of such code.

Successes!

Source: https://habr.com/ru/post/102405/

All Articles

Interpreter bottlenecks

Language features that are already compiled

Difficult in structure, but easy to use, frameworks

Working with strings

More articles: