How to pass an elephant through the eye of a needle. Processing the maximum amount of data in the shortest time

What you will hear from apologists of the same Java or C # about development in C / C ++! Ostensibly this language is outdated and no one writes on it. That's only when you need to create a no latency or low latency service, or you need to save memory and execution time of a bottleneck for processing large amounts of data, then you immediately resort to the help of “archaic” C / C ++ developers. Just because these guys are able to manually manage the memory and are perfectly aware of the stuffing of this or that high-level operation. Today our task is to get one step closer to these guys.

Under the hood of a race car

The business logic of a network application is usually not the place where they prefer to use C ++ as a development language. As a rule, higher-level languages are chosen for networking between application clients and database servers. But sooner or later, the application grows to a level where it is cheaper to optimize it than to purchase new servers. In this case, we will have a fascinating attraction of embedding the implementation of a part of business logic in C / C ++ into well-established logic in C #, Java, Python, Ruby or JavaScript. Here you are probably waiting for the most amusing surprise: in C ++ you need to be able to handle large amounts of data efficiently. Skills in Java or C # will quickly nullify all optimization attempts if you just try to write about the same C ++ code.

The fact is that new should be used as economically as possible, and the irrational use of containers that are not quite suitable for the situation will make a perfectly logical code absolutely unsuitable in practice. It is possible that after the “optimization” performed by an employee who is not fully qualified specifically in C ++, the execution time may remain approximately the same or even increase. Someone will spread his hands, they say, tried, but everything is so optimized as much as possible. Someone will try to convince colleagues of the incredible speed of a high-level language. Our task is not to waste the money of the company. So that the resources spent on embedding C / C ++ in critical parts of the code not only were not in vain, but paid for themselves many times over. It’s not the specialists who shrug their hands and say “well, they could not,” but those who are trying to achieve the impossible. After all, it only seems impossible, and nothing complicated in this lesson will be. All that is required is to remember a few important things that will come in handy when processing data in C ++.
')

We choose the tools carefully

If you have not yet had the luck to read Scott Meyers' excellent book, Effective Use of STL, I highly recommend it. Without a detailed understanding of which situation a particular container is needed for in C ++, using STL would be akin to repairing asphalt in the rain. I will give some basic advice, but it is important to thoroughly understand the purpose of different containers and the structure of their methods.

The first thing to always remember: `std :: vector` is not an array, but a vector. Use this container if you need exactly vectorization of a continuous piece of memory in the form of similar elements. If regular addition and removal is required at an uncontrolled size, then `std :: vector` is hardly useful. When you need an array with element-by-element access and an inexpensive increase in size, look better in the direction of `std :: deque`. After all, if we don’t have enough reserved memory, it will first allocate a new continuous (!) Block, and then transfer data from the old memory of the `std :: vector` object to the new element-wise. Since we consider the processing of large amounts of data in the shortest time, the redistribution of memory for already existing objects is not at all what you want to spend CPU time on.

The second prerequisite is to carefully select a container with a unique key to value relationship. In the case of big data, it's probably easiest to build `std :: unordered_map` right away and try to change it as little as possible. The fact is that taking on a key in `std :: unordered_map` is much more efficient than in` std :: map`, again for large amounts of data you do not need to build and keep in memory a red-black tree with an exorbitant number of nodes. But if the key-value ratio changes frequently (the relationships are removed, new ones are added, and this is done quite intensively), then it is easier to accept the maintenance of `std :: map` than to rebuild the internal representation of` std :: unordered_map`. After all, inside `std :: unordered_map`, in fact, an array of chains of values, and the more often we change the key-value relationship, the less effective its use becomes. Even faster key retrieval will not save here: rebuilding large arrays is always expensive.

The third important point is logic. First write the most efficient algorithm, and then see what is logically suitable for storing data when it is working. Always try to choose a container in the only obvious way. You need a set of unique values - take `std :: set`, you need a dictionary that rarely changes - feel free to use` std :: unordered_map`, you need an array, and you don’t know its size in advance - you will most likely need `std :: deque`, if the array size is known in advance, then the usual `std :: vector` may come up.

The fourth is performance measurements. You should always check your decision about choosing a container or algorithm by benchmarking with testing runtime on similar containers. Thus, it may be that the sorted `std :: vector` pairs of keys - the value may be more efficient in processing than the logically appropriate` std :: map`, built on this relationship.

You can have any ideological views on the program code, but the only authority you should trust is the profiler, which provides measurements of the execution time of the code with different versions of its construction.

Text Costs

First and foremost, what should be learned when working with text: `std :: string` is not the only way to save and process text or part of it. In the case of a substring, it is not necessary to make millions of new containers for each piece of a large string `std :: string`, it is enough to indicate the beginning and end of each substring. It is better to have your own structure with a pair of begin / end iterators on the source line than for each substring to build a new continuous block of memory and copy into it a part of the text already stored in the source line.

Example: we find all the words in the text represented as a pointer to a null-terminated string. For simplicity, let our words be separated by a semicolon.

template <typename word_type> void find_words(char const *text, std::deque<word_type>& result) { char const *start = text; char const *finish = text + std::strlen(text); while (start < finish) { char const* last = std::strchr(start, ';'); if (!last) last = finish; result.push_back(word_type(start, last)); start = last + 1; } }

As a `word_type`, we will try both the standard` std :: string`, and our own type, which stores a pointer to the beginning and end of the substring in the original string.

 struct one_word { one_word(char const *begin, char const *end) : m_begin(begin), m_end(end) { } char const *m_begin, *m_end; };

As a result of simple comparative measurements, it turns out that if you do not spend time generating absolutely unnecessary intermediate lines in the `std :: string` containers, the code starts to run 15–20 times faster. If the majority of words do not fit into the 16-char buffer that was originally reserved in `std :: string`, then in addition we have to dynamically allocate new blocks of memory to store substrings. Our class `one_word` during initialization fills only two fields of the type pointer to a character, and this is enough to go through the spacing later.

With a special disregard for any optimization, the Boost library suffers, so if you suddenly decide to use `boost :: split` or` boost :: regex`, remember this solution when the profiler shows the performance creep when parsing a string with massive implicit creation of all sorts of unnecessary rubbish .

JSON, XML and all-all

To work with text protocols should be approached as carefully as possible. As the previous example shows, the slightest miscalculation when creating auxiliary objects can kill the performance with a code that is completely harmless at first glance.

For each of the common protocols there is a whole bunch of libraries. But you should remember simple truths.

First and foremost. You almost never need to build a full tree for the XML / JSON / YAML structure when reading it from anywhere. Usually, it all comes down to extracting a series of similar values.

Second: do not refrain from reinventing the wheel, if the task is critical in terms of performance, and the case you have is so private that abandoning the beaten path will be the right decision.

Third: when serializing, please try to get by generating into a buffer on the stack. If this is not possible, then write in `std :: string` with a reserve call in advance. The code will never be effective if you first trash all the RAM using `std :: stringstream`, and then also collect a string from it, additionally sticking together what you can immediately assemble into a result.

As a home task, compare in performance the generation of a large text configuration in the same XML with and without `std :: stringstream`. RAM in the form of cheese with a bunch of holes of fragmentation does not have to speed.

The base will respond

When requesting a database, we do not know at the compilation stage what kind of value will come to us. More precisely, we can build a query string, we can even wrap it in a simple SQL-like ORM on the C ++ side, but most importantly, at the compilation stage, we almost never know what the database says in response.

In this regard, dynamically typed languages with attribute generation on the fly, such as Python, Ruby and JavaScript, have a distinct advantage over compiled languages with static typing.

You can, of course, make all sorts of types like `IntField`,` FloatField` and other `* Field` with a common ancestor like` BaseField`, and then go through all the branches of the code, using references and pointers. This will lead to the fragmentation of a single, in general, response from the database - it will turn out to be pushed into small memory cells.

However, recalling the first three lessons of our academy, we can easily bypass the limitations of the C ++ language and at the same time get a digestible API. All that remains is to minimize the cost of dynamic memory allocation for each field in each record. This is not so difficult to do.

Classic DBMS, in response to a SQL query, give us table data, that is, we know the format of each record, even at the execution stage, and not at the compilation stage. According to the record structure, we can initially allocate memory for all the data of all the fields in total. In the future, the good old placement placement will help us break the field values by the prepared memory cells. It looks like this:

From the metadata of the query, we learn the data type of each field in the query results. For each type on the database side, we have a similar type on the business logic side. Scalar data is data with a fixed size, allocating memory for them means leaving space in the buffer, they don’t even need a constructor. Slightly more complicated with data that allocates additional memory on the heap, such as `std :: string` for representing the type` text`. However, `std :: string` itself has a certain size, so we can say that for any type of field in the database we know the type and size on the business logic side.
Next, we simply add the sizes of the record field types and get the size of the data block for each record. To allocate memory for the entire result of the query, we can allocate memory once for all fields. It turns out that we climbed into the heap after the memory only once, immediately after reading the fields from the metadata of the query result. The complexity of the operation is multiple to the number of fields as a result of the request: even if a thousand fields are requested, this is much easier than allocating memory for 1000 × * number of records as a result * for incomprehensible IField *.
For the convenience of data processing, we still have to build a certain class field. It will be a container with dynamically typed data from the first two lectures of the “C ++ Academy”. In fact, each will be by value, but the type is optionally stored - it corresponds to the type as a result of a query in the corresponding data cell, or NULL.
Since the vast majority of data types stored in the database are textual, it may make sense to inline small strings with a limited size on the database side not into `std :: string`, which will crawl into a heap, but directly into the` field` memory. In this case, we will get a good optimization when allocating memory, but we will have to suffer with the implementation of the necessary methods, since the `char m_text [SIZE]` alone will not do anything, and the capabilities of the pure C on working with strings and memory are not adapted to work with database.
Now the main thing: allocating memory for each type in the record, we create these fields using the `new (<where>) construction Type (<parameters>)`.

Here is how it should look in the implementation. The main class is a field container that is dynamically typed from any type you use in the database. If you only have scalar types, text data and NULL, you’ll have something like this:

 class field { public: template <typename value_type> field(void* address, value_type const& value); template <typename value_type> field& operator = (value_type const& value); template <typename value_type> value_type get_as() const; bool is_null() const; protected: class data; private: data* m_data; }; class field::data {...}; template <typename value_type> class data_holder : public field::data { public: data_holder(); // NULL data_holder(value_type const& value); <  > private: value_type m_value; }; template <typename value_type> field::field(void* address, value_type const& value) : m_data(new(address) data_holder(value)) { }

If swapping and deleting fields are rare as a result of a query, then safely vectorize the memory for the entire block. If you actively play tag data with the data inside the results, then your choice is an array of memory decks, where the memory for each cell is allocated separately. In this case, the model from the second lecture of the “C ++ Academy” is more suitable, where we store the memory for object implementation in the class data and initialize it via the 'placement new` inside the implementation, which in this case, on the one hand, will allow using the record as a full-fledged std :: deque` of homogeneous objects, and on the other, it limits the use of big data inside objects. It will no longer be possible to inline lines into the memory of records, but it will be easy to play with the presence and order of fields, which is important for, for example, non-relational databases, or for proxy logic with additional processing of the results obtained from other business logic, which we do not directly influence. we can Then the field type itself will look a bit different:

 class field { public: //     template <typename value_type> field(value_type const& value); <  > private: data* m_data; uint8_t m_buffer[MAX_FIELD_SIZE]; };

Now you understand the thoroughness with which I described the mechanisms of dynamic typing at the beginning of the Academy course, and why in the second lesson we talked about saving memory when placing data of a previously unknown type inside a class.

Whichever way you choose, a single block of memory for the entire record or even for the entire result of the query on the business logic side or flexible control of the field deck in each record will still be better than mass memory allocation for each field and working with them through pointers on the interfaces between which there will be a constant type conversion.

Quiet! Recording!

In both cases, the recording class will work like this:

Initialization with some set of field information.
Memory allocation by meta-information will occur only once.
Then, in the cycle, the list of record fields is filled, with an offset from the total memory for each subsequent record field.
If the information about the fields came immediately with the values, then immediately initialize the fields together with the values directly to the desired address.

The implementation of the record will look slightly different depending on whether you have a monolithic data block for the entire record, or is it a set of interchangeable elements of the same type with dynamically typed content. I will give an example of how the implementation of a record for a single memory block might look like:

 class record { public: record(query_result const& result, size_t row); //      field field [const]& operator[](<    int, char const*  std::string const&>) [const]; private: std::vector<uint8_t> m_buffer; std::vector<field> m_fields; }; record::record(query_result const& result, size_t row) { size_t buffer_size = std::accumulate( result.types().begin(), result.types().end(), 0, [](size_t init, field::type type){ return init + type.size(); }); m_buffer.resize(buffer_size); m_fields.reserve(result.types().size()); for(size_t offset=0, index=0; index<result.types().size(); ++index) { m_fields.push_back(field(offset,result[row][index])); offset += result.type_of(index).size(); } }

If we use the maximum acceleration and use one block of memory for the entire result of the query, the write code will change slightly:

 class record { public: //      record(void* address, query_result const& result, size_t row); <    > private: std::vector<field> m_fields; <     > };

The most important optimization

It is important to remember that it is cool to lay out the fields from the query results by the presentation fields on the side of business logic, but very often it is not necessary. If you need to directly generate JSON by the result of the client database API, then it is not necessary to create the darkness of `record` objects with a bunch of` field` objects in each. Even if optimized, this action is superfluous. , JSON, XML, YAML .

, , , . bool, , hello , `[[«hello»]]`, true.

— .

, , — C++, , , , — . Java, C# Python , C/C++, , . , , , . `std::vector<uint8_t>` , `uint8_t packet_buffer[MAX_PACKET_SIZE]` .

C++, , , . , . , , , : ( Java), .

C++ , , , . `std::map` — `vector of vector` `T**`. `std::string` .

— , . - ? , , . !

#195.
: Qualab , ++ Parallels

Subscribe to "Hacker"

Source: https://habr.com/ru/post/258961/

All Articles