
annotation
The article reveals the features of high-level optimization of computational algorithms in Java on the example of the cubic matrix multiplication algorithm.
Step 0. Set a reference point!
Decide on the environment:
- Hardware: 1-socket / 2-cores Intel Core 2 Duo T7300 2GHz, 2Gb ram;
- Java: HotSpot (TM) Client VM (build 17.0-b17, mixed mode, sharing);
Consider the simplest cubic matrix multiplication algorithm. Perhaps, any student or even a schoolboy knows him. In addition, the principle of the algorithm is perfectly illustrated by a picture from Wikipedia at the beginning of the article, so we will not dwell on its features.
')
for (int i = 0; i < A.rows(); i++) { for (int j = 0; j < A.columns(); j++) { double summand = 0.0; for (int k = 0; k < B.columns(); k++) { summand += A[i][k] * B[k][j]; } C[i][j] = summand; } } 
As workload, we take two dense square double precision N x N matrices (double precision), where N = 1000.
In this case, the operating time of a simple cubic algorithm: 
18.921 s . We will consider this number as a point of reference (baseline) for subsequent optimizations.
Step 1. Know the rule 20/80!
It is known that 80% of the time works 20% of the code. The task of the developer is to calculate these same 20% of the total code. In our case, everything is obvious - the required 20% are concentrated in the internal computing cycle. From the point of view of the JVM, this code is the most likely candidate for a just-in-time compilation. You can check whether the bytecode is compiled into native by using the options 
- XX: + PrintCompilation -XX: + CITime .
 $ java -XX:+PrintCompilation la4j.Demo 1 java.lang.String::hashCode (64 bytes) 2 java.lang.String::charAt (33 bytes) 3 java.lang.String::indexOf (151 bytes) 4 java.lang.String::indexOf (166 bytes) 5 java.util.Random::next (47 bytes) 6 java.util.concurrent.atomic.AtomicLong::get (5 bytes) 7 java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes) 8 java.util.Random::nextDouble (24 bytes) 9 la4j.matrix.DenseMatrix::set (10 bytes) 1% la4j.matrix.MatrixFactory::createRandomDenseMatrix @ 30 (68 bytes) 10 la4j.matrix.MatrixFactory::createRandomDenseMatrix (68 bytes) 2% la4j.matrix.DenseMatrix::multiply @ 81 (152 bytes) 
It can be seen from the listing that the matrix multiplication method (la4j.matrix.DenseMatrix :: multiply) was successfully compiled. This gives us a certain amount of room for further action. First, let's take advantage of the final variables. It is known that when compiled into native code, they are translated directly into values. Theoretically, replacing matrix boundaries with final values ​​would reduce memory access by 1000x1000x1000 times. Instead, the values ​​will be substituted directly. Let's check the theory.
 final int aColumns = A.columns(); final int aRows = A.rows(); final int bColumns = B.columns(); final int bRows = B.rows(); for (int i = 0; i < aRows; i++) { for (int j = 0; j < aColumns; j++) { double summand = 0.0; for (int k = 0; k < bColumns; k++) { summand += A[i][k] * B[k][j]; } C[i][j] = summand; } } 
The running time of the algorithm: 
16.996 s ;
Performance boost: 
~ 11% ;
Step 2. Remember Kesh!
In fact, such an implementation will always work slowly. Iteration over “k” refers to the elements of matrix B in columns. Such reading is very expensive, because the processor has to prefetch the data from the memory each time, instead of taking it ready from the cache.
It is noteworthy that Fortran does not have such a problem. Multidimensional arrays are stored in it in columns. Therefore, the cache will be operated as it should and the regular implementation of the cubic algorithm will work many times faster.
On the platform in question, the first-level cache line size (L1) - 64 bytes - this is the cheapest memory for the processor. In our case, 64 bytes of almost free memory give the potential to receive as many as 8 cells of the matrix (sizeof (double) = 8) in the place of one more followed by the overhead from prefetching.
The problem of the algorithm is that it practically does not use this feature. At the same time, random access to memory (by columns) resets the cache line and initializes the update procedure for the cache line on each access.
Let's try to fix this and implement sequential access to the elements of the matrices in order to get the maximum benefit from the cache. To do this, simply transpose the matrix B and we will refer to its elements in rows.
 final int aColumns = A.columns(); final int aRows = A.rows(); final int bColumns = B.columns(); final int bRows = B.rows(); double BT[][] = new double[bColumns][bRows]; for (int i = 0; i < bRows; i++) { for (int j = 0; j < bColumns; j++) { BT[j][i] = B[i][j]; } } for (int i = 0; i < aRows; i++) { for (int j = 0; j < aColumns; j++) { double summand = 0.0; for (int k = 0; k < bColumns; k++) { summand += A[i][k] * BT[j][k]; } C[i][j] = summand; } } 
The running time of the algorithm: 
7.334 s ;
Performance boost: 
~ 232% ;
Step 3. Think!
The code has become more, however, the operation time has been reduced by almost 2.5 times. Not bad. However, when viewing the code, there are obvious shortcomings. The main one is a lot of cycles. Let's try to reduce the amount of code and make it more elegant by combining the operations of transposition and multiplication into one computational cycle. At the same time, the transposition will be carried out not for the whole matrix but by columns, as they are needed.
 final int aColumns = A.columns(); final int aRows = A.rows(); final int bColumns = B.columns(); final int bRows = B.rows(); double thatColumn[] = new double[bRows]; for (int j = 0; j < bColumns; j++) { for (int k = 0; k < aColumns; k++) { thatColumn[k] = B[k][j]; } for (int i = 0; i < aRows; i++) { double thisRow[] = A[i]; double summand = 0; for (int k = 0; k < aColumns; k++) { summand += thisRow[k] * thatColumn[k]; } C[i][j] = summand; } } 
The running time of the algorithm: 
3.976 s ;
Performance boost: 
~ 190% ;
Step 4. Throw away!
Let's take advantage of one more Java - exceptions. We will try to replace the check for exceeding the bounds of the matrix by a try {} catch {} block. This will reduce the number of comparisons by 1000 in our case. Indeed, why compare 1000 times what will always return false and return true for 1001 times.
On the one hand, we reduce the number of comparisons, on the other hand, additional overhead costs for handling exceptions appear. Anyway, this approach gives some gain.
 final int aColumns = A.columns(); final int aRows = A.rows(); final int bColumns = B.columns(); final int bRows = B.rows(); double thatColumn[] = new double[bRows]; try { for (int j = 0; ; j++) { for (int k = 0; k < aColumns; k++) { thatColumn[k] = B[k][j]; } for (int i = 0; i < aRows; i++) { double thisRow[] = A[i]; double summand = 0; for (int k = 0; k < aColumns; k++) { summand += thisRow[k] * thatColumn[k]; } C[i][j] = summand; } } } catch (IndexOutOfBoundsException ignored) { } 
The running time of the algorithm: 
3.594 s ;
Productivity gain: 
~ 10% ;
Step 5. Do not stop!
At this stage, I have so far stopped, because I have achieved the desired goal. My goal was to overtake the most popular library for working with matrices - JAMA (the first line in Google on the query “java matrix”). Now my implementation works about 30% faster than JAMA. In addition, relatively small optimizations resulted in an increase of almost 
600% , relative to the initial version.
Instead of a conclusion (this is not an advertisement)
The code considered is part of the open-source 
laj4 project. This is a library for solving problems of linear algebra in Java. I started working on the project in the fourth course in the process of studying a course in computational mathematics. Now la4j shows excellent performance and works several times faster than analogs on many tasks.
If someone is interested and wants to spend the evening in a similar way - write in a personal :)