For those who studied / studies at the mathematical or programming departments of universities, I think this article will not be news, but it has become interesting for me to test the speed of different algorithms. It can also be considered as a kind of optimization guide, but such optimization should be carried out only when it is really necessary, since the readability of the code is crumbling before our eyes, and it’s much harder to debug
Surely, most will be too lazy to read the whole article, but I advise you to scroll down and read the conclusions - I will wash interesting figures there.
So the problem: to multiply two large matrices of doubles (third-order sizes). For simplicity, we will consider square matrices, although all algorithms are suitable for rectangular ones. The algorithm was written in C ++, but I did not use classes anywhere, so we can assume that the code is C-compatible (maybe only cout used it).
')
I will not explain here what a matrix is and how to multiply them — to those who do not know, it is unlikely to be interested in how to speed up multiplication ...
To begin with, we define that the matrices will be stored in a one-dimensional array of size n * n. The element A (i, j) will be referred to as a [i * n + j]. Store in one-dimensional faster than in a two-dimensional array, because one integer multiplication and one memory access are faster than two memory accesses.
The first is the most logical way to multiply.
int multMatrixSqBad (double * a, const int n, double * b, double * c) {
int i, j, k;
for (i = 0; i <n; i ++) {
for (j = 0; j <n; j ++) {
c [i * n + j] = 0;
for (k = 0; k <n; k ++) {
c [i * n + j] + = a [i * n + k] * b [k * n + j];
}
}
}
return 0;
}
Note that firstly we in the cycle too many times multiply i * n. Each time, accessing memory by adding to c [i * n + j] is optional - you can add all this into a variable, and only then equate it with c [i * n + j]. Note also that compilers now optimize our code very well, but problems arise at branch points (if, for, etc.), to improve performance, we must strive to increase the depth of the instruction pipeline, i.e. the linear part of the algorithm (when the calculations go on in succession, without conditions or cycles), therefore, we turn our main cycle so that inside the third cycle not one, but 4 multiplications take place. Well, we will try as little as possible to access memory and multiply, and keep everything in local variables and pointers, which also speeds up the work. We do code duplication for even and odd matrix sizes, in order not to check for evenness inside the loop, but to check it only once. We get the following function:
int multMatrixSq (double * a, const int n, double * b, double * c) {
double s1 = 0, s2 = 0, s3 = 0, s4 = 0;
double * f, * g, * h;
int i, j, k;
if (n% 2 == 0) {
for (i = 0; i <n-1; i + = 2)
{
f = a + i * n;
for (j = 0; j <n-1; j + = 2)
{
g = b + j;
for (k = 0; k <n; k ++) {
s1 + = f [k] * g [n * k];
s2 + = f [n + k] * g [n * k];
s3 + = f [k] * g [n * k + 1];
s4 + = f [n + k] * g [n * k + 1];
}
h = c + i * n + j;
h [0] = s1; s1 = 0;
h [n] = s2; s2 = 0;
h [1] = s3; s3 = 0;
h [n + 1] = s4; s4 = 0;
}
}
}
else {
for (i = 0; i <n-1; i + = 2)
{
f = a + i * n;
for (j = 0; j <n-1; j + = 2)
{
g = b + j;
for (k = 0; k <n; k ++) {
s1 + = f [k] * g [n * k];
s2 + = f [n + k] * g [n * k];
s3 + = f [k] * g [n * k + 1];
s4 + = f [n + k] * g [n * k + 1];
}
h = c + i * n + j;
h [0] = s1; s1 = 0;
h [n] = s2; s2 = 0;
h [1] = s3; s3 = 0;
h [n + 1] = s4; s4 = 0;
}
if (j == n-1) {
g = b + j;
for (k = 0; k <n; k ++) {
s1 + = f [k] * g [n * k];
s2 + = f [n + k] * g [n * k];
}
h = c + i * n + j;
h [0] = s1; s1 = 0;
h [n] = s2; s2 = 0;
}
}
if (i == n-1) {
f = a + i * n;
for (j = 0; j <n-1; j + = 2)
{
g = b + j;
for (k = 0; k <n; k ++) {
s1 + = f [k] * g [n * k];
s3 + = f [k] * g [n * k + 1];
}
h = c + i * n + j;
h [0] = s1; s1 = 0;
h [1] = s3; s3 = 0;
}
if (j == n-1) {
g = b + j;
for (k = 0; k <n; k ++) {
s1 + = f [k] * g [n * k];
}
h = c + i * n + j;
h [0] = s1; s1 = 0;
}
}
}
return 0;
}
Well, the final touch. When we go through the matrix in a cycle, we somehow have to take information from the RAM (especially when we read the column). Information from the RAM to the cache is stored in rows (i.e., when we read some element - several elements following it are put into the cache), and when we go through the column, we have to climb each time into memory. To prevent this from happening, we often had to work with a cache, and not with RAM, do the following: we divide our matrix into submatrices of about 100 by 100 (the size is specified in the function argument). If the shared is not completely - not scary, leaving rectangular pieces from the bottom and right. We call our submatrices A (1,1), ..., A (n, n), respectively. And we will multiply the matrix of submatrices according to the usual rule. It is not difficult to mathematically prove that the answer will turn out the same as with the usual multiplication, but now when we multiply two small matrices - they both fit into the cache and multiply much faster. Even for multiplication, we need the functions "take a submatrix" and "put a submatrix adding to what it was." The zeroing functions of the matrix, as well as the multiplication function of rectangular matrices. Because we do not have a lot of rectangular multiplications, so I didn’t begin to optimize it in the same way as for square ones. Now it seems that it was more correct to write one good function for rectangulars, and for squares, to make just an alias, I don’t remember why, but I didn’t. So, we write the final function:
int multMatrixBlock (double * a, const int n, double * b, const int m, double * res, double * c, double * d, double * e) {
int n_block = n / m;
int m_small;
int i, j, k, o;
m_small = n-n_block * m;
clearMatrix (res, n * n);
for (i = 0; i <= n_block; i ++)
{
for (j = 0; j <= n_block; j ++)
{
for (k = 0; k <= n_block; k ++)
{
if (i <n_block && j <n_block && k <n_block) {
getSubMatrix (a, n, c, m, i, k);
getSubMatrix (b, n, d, m, k, j);
multMatrixSq (c, m, d, e);
setSubMatrixAdded (res, n, e, m, i, j);
}
else if (i <n_block && j <n_block && k == n_block) {
getSubMatrix (a, n, c, m, i, k);
getSubMatrix (b, n, d, m, k, j);
multMatrixRect (c, m, d, m_small, m, e);
setSubMatrixAdded (res, n, e, m, i, j);
}
else if (i == n_block && j <n_block && k <n_block) {
getSubMatrix (a, n, c, m, i, k);
getSubMatrix (b, n, d, m, k, j);
multMatrixRect (c, m_small, d, m, m, e);
setSubMatrixAdded (res, n, e, m, i, j);
}
else if (i == n_block && j <n_block && k == n_block) {
getSubMatrix (a, n, c, m, i, k);
getSubMatrix (b, n, d, m, k, j);
multMatrixRect (c, m_small, d, m_small, m, e);
setSubMatrixAdded (res, n, e, m, i, j);
}
else if (i <n_block && j == n_block && k <n_block) {
getSubMatrix (a, n, c, m, i, k);
getSubMatrix (b, n, d, m, k, j);
multMatrixRect (c, m, d, m, m_small, e);
setSubMatrixAdded (res, n, e, m, i, j);
}
else if (i <n_block && j == n_block && k == n_block) {
getSubMatrix (a, n, c, m, i, k);
getSubMatrix (b, n, d, m, k, j);
multMatrixRect (c, m, d, m_small, m_small, e);
setSubMatrixAdded (res, n, e, m, i, j);
}
else if (i == n_block && j == n_block && k <n_block) {
getSubMatrix (a, n, c, m, i, k);
getSubMatrix (b, n, d, m, k, j);
multMatrixRect (c, m_small, d, m, m_small, e);
setSubMatrixAdded (res, n, e, m, i, j);
}
else if (i == n_block && j == n_block && k == n_block) {
getSubMatrix (a, n, c, m, i, k);
getSubMatrix (b, n, d, m, k, j);
multMatrixSq (c, m_small, d, e);
setSubMatrixAdded (res, n, e, m, i, j);
}
}
}
}
return 0;
}
We take the Hilbert matrix as the initial matrix, since all elements are less than one and it is very convenient to multiply (there will be no overflow).
double formula (int i, int j, int n) {
return 1 / (double (i + j + 1));
}
The full text of the program
here or
hereThe matrix_test.cpp file was compiled like this:
g ++ matrix_test.cpp -O3
Here are the test results on my computers (the size of the submatrices is 200 for iMac and Toshiba, it seems to be optimal, for the Athlon the optimum is 115, but even without submatrices it proved to be not the best):

Test computers:

As you can see, the performance increase is colossal (acceleration is more than 10 times on the most modern processor). It is worth considering that this was all considered in one process, which means that the dual-core Core2Duo did not play a role. In the future, of course, it will be right to multithread or mpi to parallelize the program and get another 1.8-1.9 times increase. It is also worth noting that progress does not stand still, and more modern processors overtake the model 5 years ago. In general, as you can see, the code is terrible, but it's worth it. And it is not the matrix multiplication method itself that is important, but possible techniques for optimizing the program.