C ++ code optimization (C ++ 11) using a specific task as an example

I would like to tell you a little about the optimization of C / C ++ programs, because there is not enough information on the Internet. In the post there will be explanations of some optimizations on the example of the task that I was given to solve at the institute (the task is implemented in C ++ 11, the explanation is given only for 64 bit systems).
The first article was written to learn more about optimization, tips on which, I hope, you leave in the comments.

The condition of the task and purpose

The task is rather trivial:

In the city of M, N new microdistricts are being built, between which M main roads are planned. Each road has its own construction cost Pij. Last year, the municipality managed to build K of these M roads. Unfortunately, this year it was decided to reduce the financing of construction, and now from the planned but not built MK roads, it is necessary to leave such that at least one way exists from any neighborhood to any other, but the cost of their construction is minimal (guaranteed that such a set exists).
What is the minimum cost P of the completion of the road network construction under the new plan, and how many new roads on it are to be built?

input Output

Input Format:
The first line contains two numbers separated by a space:
N (2 ≤ N ≤ 10 ^ 5) is the number of residential areas under construction,
M (1 ≤ M ≤ min (N (N - 1), 2 * 10 ^ 6)) is the number of planned roads.
The following M lines describe the planned roads, sorted by a pair of numbers of their ends, and each contains three integers separated by a space:
i (1 ≤ i <N) - number of the microdistrict in which the road starts,
j (i <j ≤ N) is the number of the microdistrict in which the road ends,
Pij (0 ≤ Pij ≤ 103) - the cost of road construction. 0 if the road is already built.

Output Format:
The output must contain two numbers separated by a space: the minimum cost P of the completion of the road network construction under the new plan and L is the number of roads to be built on it.

and is solved by the Kruskal algorithm (building the minimum spanning tree). Initially, the contest was limited to 2 seconds for execution and 24 MB of memory. Not serious. I set myself the goal to optimize the program as much as possible and to keep within 100ms for the most difficult test, I want to note that you can only use the STL library and the fixed compiler settings (as this is in any way a problem from the contest). The task was not fully accomplished - I managed to fit 122 ms and 11mb of memory. For the most impatient I quote the program code:

Program code

#include <iostream> #include <vector> #include <algorithm> /* *       */ template <class T> struct Triplet { T first; // ,     T second; // ,     T third; //    }; /* *   UnionFind   */ template <class T> class UF_element { public: T parent; //   T size; //   UF_element() {} UF_element(T id) : size(1) , parent(id) { } }; template <class T> class UnionFind { private: std::vector<UF_element<T>> Elements; public: /* *     ,    */ UnionFind(T n) : Elements(n + 1) { for (register T i = 1; i < n + 1; ++i) { Elements[i] = UF_element<T>(i); } } /* *         . *    (log n), *    O(1) */ T find(T id){ UF_element<T>& current = Elements[id]; while (current.parent != Elements[current.parent].parent){ current.parent = Elements[current.parent].parent; } return current.parent; } void merge(T a, T b) { UF_element<T>& UF_elem_A = Elements[find(a)]; UF_element<T>& UF_elem_B = Elements[find(b)]; if(UF_elem_A.parent == UF_elem_B.parent) return; if(UF_elem_A.size < UF_elem_B.size) { UF_elem_A.parent = b; UF_elem_B.size += UF_elem_A.size; } else { UF_elem_B.parent = a; UF_elem_A.size += UF_elem_B.size; } } }; template <typename T> bool comp(const Triplet<T>& a, const Triplet<T>& b) { return (a.third < b.third); } template <typename T> void kruskal(std::vector<Triplet<T>>& data, size_t& cost, size_t& count, const size_t& n) { UnionFind<T> N(n); for (size_t i = 0; i < data.size(); ++i) { if (N.find(data[i].first) != N.find(data[i].second)) { if (data[i].third) { //     0 (.    ) cost += data[i].third; //     ++count; //   ,    } N.merge(data[i].first, data[i].second); } } } /* *       */ template <typename T> void read_unlocked(T &out) { T c = getchar_unlocked(); for (; '0' <= c && c <= '9' ; c = getchar_unlocked()) { if (c == ' ' || c == '\n') return; out = out * 10 + c - '0'; } } int main() { std::ios_base::sync_with_stdio(false); size_t count = 0, cost = 0; //    .  @ilammy unsigned int n = 0, m = 0; read_unlocked(n); read_unlocked(m); std::vector<Triplet<unsigned int>> input(m); for (size_t i = 0; i < m; ++i) { read_unlocked(input[i].first); read_unlocked(input[i].second); read_unlocked(input[i].third); } std::sort(input.begin(), input.end(), comp<unsigned int>); kruskal(input, cost, count, n); std::cout << cost << " " << count << '\n'; }

Data alignment

You can learn more about what data alignment is, here , I will just explain with specific examples. Unfortunately, this is not explicitly used anywhere in my code, but let's look at this structure:

 struct A { char a; long int b; int c; };

This structure is 24 bytes (for this we will display the result of sizeof (A)).
Now consider the following structure:

 struct B { char a; int c; long int b; };

It takes only 16 bytes. Why is that? The fact is that on a 64-bit OS, the memory is read in chunks of 8 bytes and to speed up processing by the processor, the data is aligned, specifically for class A, the structure of the memory location looks like this:
(1 byte for a + 7 empty bytes) + (8 byte for b) + (4 bytes for c + 4 empty bytes) = 24 bytes.
Not very economical, right? For class B, the memory occupied is as follows:
(1 byte for a + 3 empty bytes + 4 bytes for c) + (8 bytes for b) = 16 bytes.
It is worth paying attention to such a trifle, because moving data “back and forth” costs time, and the less data moves, the faster your program accelerates, and then, swapping a couple of lines is not the most difficult optimization.

Initialization lists

Why a constructor should look like this:

 UF_element(T id) : size(1) , parent(id) { }

Not so:

 UF_element(T id) { size = 1; parent = id; }

You can read here . When using initialization lists, the constructor is called with the passed values, in contrast to the second case, when the default constructor is called for the class fields and the assignment is made in the body of the class constructor.
')

Register specifier

A bit of theory:

register is an automatic memory class specifier. Applies to objects located in local memory by default. It is a “request” (optional for execution) to the compiler about placing the values of objects declared with the register specifier in one of the available registers, and not in the local memory.

In my code I used the following structure:

 for (register T i = 1; i < n + 1; ++i) { idElements[i] = UF_element<T>(i); }

It is necessary to use this specifier correctly.
If you need to score an array with numbers from 1 to n (almost as I have in the example above), then using the specifier will increase in speed, if the loop body is not so trivial, then most likely register will not only not change performance, but may reduce it, allocating a variable case.

Work with memory

Very short and well written here . In general, when you begin to understand how the program uses memory and, in particular, you are looking for leaks, some rules are developed.

For myself, I realized one thing: until the programmer is able to tell straight away when and where the memory allocated using the new operator is freed and how convenient it is for the memory manager to access the free memory after calling delete, it should be beat with rags for use new / delete. Therefore, in order not to be beaten, I got rid of the use of these operators architecturally:

  UnionFind(T n) : Elements(n + 1) //        { for (register T i = 1; i < n + 1; ++i) { Elements[i] = UF_element<T>(i); //    } }

Input and output data

Guided by this article , input was greatly accelerated. Since in this task only 2 numbers are output, the use of putchar (putchar_unlocked) does not bring a large profit, but for those who do not believe, until they check, I give below the output implementation using putchar_unlocked.

Output function

 /* *      : *   -  ,    *   -  ,      */ template <typename T> void write_unlocked(T inp, char last=' ') { if (!inp) { putchar_unlocked('0'); putchar_unlocked(last); return; } T out = 0; unsigned char sz_inp = 0; //   for (; inp; inp /= 10, ++sz_inp) { out = (out << 3) + (out << 1) + inp % 10; } //   for (; out; out /= 10, --sz_inp) { putchar_unlocked(out % 10 + '0'); } //  ,      for (;sz_inp; --sz_inp) { putchar_unlocked('0'); } putchar_unlocked(last); }

Source: https://habr.com/ru/post/256929/

All Articles