Java code optimization in Android Marshmallow

Improving system performance, improving user experience of working with applications: these are the directions in which Android is evolving. In Android Marshmallow you can discover many new features and capabilities. In particular, we are talking about serious improvements Android Runtime (ART). They focus on performance, memory consumption and multitasking.

New platform release released? Has the Android virtual machine changed? Any of these events means that the developer needs to urgently understand the essence of the innovations. Namely, it is necessary to understand what methods that have made it possible to achieve high performance solutions in the past are no longer so effective. We need to find new approaches to developing applications that can give the best results. Almost no one writes about such intricacies, so developers have to figure it all out through trial and error.

Today we will talk about how to write fast and convenient programs, taking into account the features of the updated Android Runtime. Our tips are aimed at improving performance and improving the quality of machine code that is generated from Java texts. We will also talk about the features of low-level optimization, which does not always depend on the source Java-code of the application.

Java code, performance enhancement and ART

Android - a system with a rather complex internal structure. One of its elements is a compiler that converts Java code to machine instructions, for example, devices based on Intel processors. Android Marshmallow includes an optimizing compiler (Optimizing compiler). This new compiler optimizes Java applications, creates code that is more productive than the one that was issued by the fast compiler (Quick compiler) in Android Lollipop.
')
In Marshmallow, almost all applications are prepared for execution using an optimizing compiler. However, for the methods of the Android System Framework still use the fast compiler. This is done in order to give Android developers more debugging capabilities.

Accuracy, speed and math libraries

For the organization of floating-point calculations, you can use different implementations of the same operations. For example, in Java, the Math and StrictMath libraries are available. They offer different levels of accuracy when performing floating-point calculations. And, although StrictMath allows to get more predictable results, in most cases the Math library is quite enough. Here is the method in which the cosine is calculated.

public float myFunc (float x) {    float a = (float) StrictMath.cos(x);    return a; }

Here we used the corresponding method from the StrictMath library, but if the loss of computational accuracy is acceptable for a particular project, Math.cos (x) can be used in a similar situation. Choosing the right library is worth considering that the Math class is optimized for using the Android Bionic library for Intel architecture. As a result, operations from Math are performed 3.5 times faster than those from StrictMath.

In some situations, of course, can not do without StrictMath, replacing it with Math is undesirable. But, again, in most cases, it is possible to use Math. The choice of a specific library is not an abstract search for a compromise between speed and accuracy. Here you should take into account the requirements of the project as much as possible - from the features of the algorithm and its implementation, to the hardware platform on which you plan to use the application.

Recursive algorithm support

Marshmallow recursive calls are more effective than Lollipop. When recursive code is written for Lollipop, data from the DEX cache is always loaded. In Marshmallow, the same thing is taken from the source argument list, and not reloaded from the cache. Of course, the greater the depth of recursion - the more noticeable the difference in performance between Lollipop and Marshmallow. However, if the algorithm implemented recursively can be rewritten iteratively, its performance in Marshmallow will be higher.

Elimination of overrun checks for a single array: array.length

Marshmallow's optimizing compiler allows, in some cases, to avoid checking for array boundaries (Bound Check Elimination, BCE).

Take a look at the empty loop:

 for (int i = 0; i < max; i++) { }

The variable i is called the inductive variable (Induction Variable, IV). If such a variable is used to provide access to the array and the cycle goes through all its elements, the check can be omitted if the variable max is explicitly set to the value corresponding to the length of the array.

Consider an example in which the code uses the variable size, which plays the role of the maximum value for an inductive variable.

 int sum = 0; for (int i = 0; i < size; i++) {   sum += array[i]; }

In the example, the index of the element i is compared with the variable size. Suppose that size is either specified outside the method and passed to it as an argument, or defined somewhere inside the method. In any of these cases, the compiler will be unable to conclude whether the value in the variable size matches the length of the array. As a result of this kind of ambiguity, the compiler will have to generate a verification code for the output of the i value beyond the array, which will fall into the executable file and will be executed during each array access operation.

We will remake the cycle so that the compiler gets rid of the check for going beyond the array boundaries in the executable code.

 int sum = 0; for (int i = 0; i < array.length; i++) {   sum += array[i]; }

Elimination of overrun checks for multiple arrays

In the previous section, we looked at the simplest case of optimization and what should be done in order for the compiler to generate code in which there are no checks that the array is out of range. However, there are algorithms in which several different arrays having the same length are processed in the same cycle. For example, if you process two arrays in a loop, the compiler will have to perform a null check for each of them and an exit beyond the bounds.

Let us take a closer look at the optimization technique applied to processing in a loop of several arrays. Often, in order to allow the compiler to optimize the loop, you need to rewrite the code. Here is the original version.

 for (int i = 0; i < age.length ; i++) {   totalAge += age[i];   totalSalary += salary[i]; }

There is a problem in it. The program does not check the length of the salary array in any way, as a result there is a risk of getting an exception exceeding the array bounds. Such a check should be provided, for example, at the entrance to the cycle.

 for (int i = 0; i < age.length && i < salary.length; i++) {   totalAge += age[i];   totalSalary += salary[i]; }

The Java program now looks much better, but with this approach a check will be made in the machine code to go beyond the bounds of the arrays.
In the above cycle, the programmer works with two one-dimensional arrays: age and salary. And although an inductive variable is tested by comparing with the lengths of two arrays, the condition here is multiple and the compiler cannot exclude from the executable code the check of the output beyond the boundaries of arrays.

Arrays that we access in cycles are data structures that are stored in different memory areas. As a result, the logical operations performed on their properties are independent. Rewrite the code, dividing one cycle into two.

 for (int i = 0; i < age.length; i++) {   totalAge += age[i]; }         for (int i = 0;  < salary.length;  i++) {   totalSalary += salary[i]; }

After the separation of the cycles, the optimizing compiler will exclude from the executable code a check for overrun for both the age array and the salary array. As a result, Java code can be speeded up three to four times.

The optimization was successful, but now the function contains two cycles, not one. Such an approach can lead to an unjustified proliferation of source code sizes. Depending on the target architecture and loop sizes, or the number of similar optimizations, this can affect the size of the generated executable code.

Methods of multi-threaded programming

In a multithreaded program, care should be taken when working with data structures.

Suppose a program runs four identical threads before entering the loop, shown below. Each of the threads then works with an array of integers called thread_array_sum . Each thread changes one of the cells in the array, whose address, which is the unique integer identifier of the thread, is specified in the myThreadIdx variable.

 for (int i = 0; i < number_tasks; i++) {   thread_array_sum[myThreadIdx] += doWork(i); }

The last level Cache (LLC) cache shared by all processor cores is not used in some hardware architectures, for example, in the Intel Atom x5-Z8000 processor line. Split-LLC is a potential reduction in response time, since for each processor core (or two cores) its own cache is “reserved”. However, you need to maintain consistency caches. Therefore, if a thread running on kernel A changes data that a thread running on kernel B never changes, the corresponding lines will have to be updated in the cache of kernel B. This can lead to a drop in performance and problems with kernel scaling.

Due to the peculiarities of the cache architecture, if several threads write data to a single array, there is a risk to face a drop in system performance due to the possible high level of cache slip. In this situation, the programmer should use a local variable to store the results of intermediate calculations, and after they are completed, write the result to an array. With this approach, the above cycle can be converted:

 int tmp = 0;   for (int i = 0; i < number_tasks; i++) {   tmp += doWork(i); } thread_array_sum[myThreadIdx] += tmp;

In this case, the element in the thread_array_sum[myThreadIdx] array is not affected within the loop. The total value resulting from the execution of the doWork() function is stored in an array element outside the loop. This greatly reduces the potential risk of cache slipping. A slip may occur with a single reference to the array shown in the last line of code, but this is much less likely.

It is not recommended to use shared data structures in loops, unless the values stored in them should be available to other streams after the completion of each iteration. In such cases, it is usually necessary to at least use the fields or variables declared with the volatile keyword, but talking about it is beyond the scope of this article.

Optimized for small memory devices

There are very different, in terms of the amount of RAM and permanent memory, Android device. Java programs should be written so that they can be optimized regardless of the amount of memory. So, if the device has low memory, then one of the optimization factors is the size of the code. This is one of the ART options.

In Marshmallow, methods larger than 256 bytes are not precompiled in order to save permanent memory on the device. Therefore, Java applications that contain frequently used methods of large size will be executed on the interpreter, which will adversely affect performance. In order to achieve a higher level of application performance in Android Marshmallow, implement frequently used code fragments in the form of small methods so that the compiler can fully optimize them.

findings

Each release of Android is not only a new name, but also new elements of the system and technology. So it was with KitKat and Lollipop, this also applies to the transition to Marshmallow, which brought significant changes in the compilation of applications.

As is the case with Lollipop, ART uses the Ahead-of-Time compilation method, and applications are usually converted to native code during installation. However, instead of using the quick compiler (Lollipop), Marshmallow uses a new compiler - Optimizing compiler. Although in some cases the optimizing compiler transfers the work to the old one, the new compiler is the main subsystem for creating binary code from Java-texts on Android.

Each compiler has its own features and optimization tools, so each of them can generate different machine code depending on how the Java application is written. In this article, we talked about a few main features that can be found in Marshmallow, about what developers need to know when creating applications for a new platform.

There are many innovations in the optimizing compiler. Some of them are difficult to show with examples, since most of the improvements are happening, so to speak, “under the hood”. What we know is that as the Android compiler matures, one can observe how the technologies underlying it become more and more advanced, it is gradually catching up with other optimizing compilers.

The Android compiler is constantly evolving, developers can be sure that the code they write will be well optimized. So, their applications will delight users with speed and pleasant impressions from interaction with them. We are sure that the one who will work with such programs will appreciate it.

Source: https://habr.com/ru/post/300934/

All Articles