Serialization, sir! Today for dinner byte porridge, cooked from C ++ objects

Variables and types are good as long as we are inside the logic of a C ++ program. However, sooner or later it becomes necessary to transfer information between programs, between servers, or even just to show the types and values of variables to a reasonable person. In this case, we have to make a deal with the evil Serializer and pay with the performance of our code. In the last lecture of the C ++ Academy, we finally reached the main boss, who needs to learn how to win with minimal losses in code execution speed. Go!

A long time ago...

In ancient times, wise founding fathers of programming taught a sequence of bytes to turn into numbers, into logical conditions, into a sequence of actions, and even into objects in the real world. In each language there were many entities, but they were built from bytes in completely different ways. Sometimes, repeating in one language what was in the other out of the box is a challenge for the strong in spirit. It is much easier to pass an object of a class to a method written in another language that is more suitable for this. But what if the desired method is located on a remote machine or is executed in a separate process? We need to turn our logic into bytes and pass it to the receiving side, which is ready to perform the necessary method. This is how the idea of serialization was born - a single representation in bytes of data and program code logic, which is understandable both on the sender side and on the receiver side.

At this moment chaos began. There have been attempts to build one uniquely true reference serialization mechanism. Incompatible data transfer protocols began to appear and multiply. Naturally, these were their reasons: they served different purposes and were optimized for processing different data. First, the wave went to the Java world, later C # joined the holiday of life, artificially limited at birth by the platform from Microsoft (fortunately, the evil was partially defeated, famous, C # on all platforms). There were naive attempts by web developers: from awkward PHP to the increasingly popular JavaScript on the server, on the client and in your coffee maker. In all cases, attempts to declare themselves the only correct way in the design were doomed. Every developer and every task is too individual, and the diversity of languages and their internal types does not allow transferring data without loss between two diametrically opposed in essence languages or technologies. At all times, languages and platforms united, perhaps, only one thing: almost all of them were written in C / C ++.
')
Attempts to cover all the next universal language or technology development will be made more than once. But the wisest developers have long learned how to negotiate with each other how to pack the bytes of an entity so that upon receipt it can be unpacked into a similar entity on the recipient side. The main thing is to remember that on the shore of the byte streams the same is always waiting for you - the insatiable ferryman Serializer. Time after time, it will eat up the execution time of your program, beat by beat, and instead produce the encoded text in XML format instead of the program configuration object or, for example, the sequence of bytes according to ASN.1 instead of the file system directory structure. And the more difficult its task - the longer it will perform it, taking up valuable time and reducing application performance ...

How many serialization or feed

Generally speaking, the scary enemies of the performance of our application are usually three:

Uncontrolled copying of objects and subobjects.
Unwarranted dynamic heap allocation.
Thoughtless and inefficient use of serialization.

With the first two, we effectively fought in the previous lessons - their damage is obvious and indisputable, and therefore the main struggle is with them, and, as a rule, successfully.

But most dangerous is our third enemy. It would seem that such is to convert a few integers into a string and transmit over the network? At this point, the programmer usually forgets about the complexity of converting, storing and transmitting large arrays of bytes in a string, instead of several lightweight bytes, the numbers that he originally had.

The serializer is our worst enemy. Without it, we can not do, but he whispers all that if we leave everything in lines, we get an analogue of dynamic typing. “And there is no point in fluttering,” he says, “because on the client side, we expect a string of JSON or XML. Why do we need an integer, keep a set of strings in the classroom! ”Scary things are created by unhappy developers enslaved by a mysterious whisper, everywhere in their code line. Those that are stronger in spirit but not stronger in mind, on the contrary, convert the data to and fro in vain back and forth. Happily rubs his hands Serializer, seeing how efficiency suffers. And today we will defeat this evil together with you!

How the code was tempered

To tame the Serializer, you must first explore its weaknesses. To do this, we firstly need the skills acquired in previous levels:

ability to handle strings and bytes;
understanding the essence of dynamic typing;
code optimization when creating and copying objects;
representation of real numbers in binary form;
everything else from previous lessons.

Secondly, as we remember, strings when transferring data become a regular set of bytes, therefore, any protocol that is text, which is binary, operates with, in fact, bytes. However, textual, as a rule, should be “human readable”, which means additional work to the Serializer when representing scalar values in byte equivalent. After all, in order to simply turn an integer –123 into bytes of a string with a decimal representation of –123, you need to perform a not-so-simple operation. For such a conversion, nothing is provided for in C / C ++ itself, and the standard library does not please with its set:

sprintf - allows not only to convert a number to a string, but also to create a formatted message, which we do not need yet;
itoa - does exactly what we need: integer-to-array-of-characters;
stringstream is very convenient for creating readable code, but also the most nutritious for the serializer.

There is also a Boost library, whose lexical_cast is even less designed to efficiently convert a number to a string.

But first, let's create the simplest bike for ourselves, which will solve our problem - we will represent an integer in decimal form, and we will chase with library functions. The code of our function will be as simple as possible, no assembler:

size_t decstr(char* output, size_t maxlen, int value) { if (!output || !maxlen) return 0; char* tail = output; //    if (value < 0) { *tail++ = '-'; value = -value; } //    size_t len = 0; for (; len < maxlen; ++len) { *tail++ = value % 10 + '0'; if (!(value /= 10)) break; } //   if (value) return 0; //   if (len < maxlen) *tail = '\0'; //   char *head = output; if (*head == '-') ++head; //   for (--tail; head < tail; ++head, --tail) { char value = *head; *head = *tail; *tail = value; } //    return len; }

Now let's see how our straightforward implementation justified the time spent on it compared to library functions (100 million iterations in seconds, i5-2410M processor, DDR3 4GB RAM, Windows 8.1 x64):

decstr: ~ 3.5;
snprintf: ~ 26;
itoa: ~ 6.2.

In tests with enabled optimization, we overtake standard library functions with a ratio of ~ 1: 8: 2.

This is not by chance. The fact is that the itoa algorithm is tied to the base of calculus - not always 10, often 16 are also required, and not only. As to snprintf, this function does not know at all what exactly the number converts, because first it has to parse the format. Our function does practically nothing superfluous.

Of course, this does not mean that we now by all means need to create our function everywhere, but the example itself is indicative, because in the place of mass serialization of values, transformations may require much more than 100 million per second, and even if on the cores, we have all the chances of not having time when choosing an incorrect algorithm for converting numbers to a string.

Underestimation of such a base resource cost, like serialization, can haunt serious costs for a new batch of servers (the old Serializer is cunning and will surely cooperate with server hardware vendors :)). It seems that such things are easily found by the profiler. In fact, it turns out that the logic of data serialization is spread over a very thin layer over your application and is well covered by the guise of invoking library functions.

Dive into byte stream

Returning software logic entities back to the original nature of byte sequences is never easy. But for binary protocols, the conversion of integers into bytes, as a rule, boils down to the byte transfer of the number, and the rule is accepted that the high byte of the number is transmitted first. In other words, this means that on the receiver side on the overwhelming majority of machines, where the architecture for representing integers goes from the low byte, we cannot simply take four or eight received bytes and cast them accordingly to int32_t or int64_t simply by casting the pointer type by offset , like `* reinterpret_cast <int32_t *> (value_pointer)`.

To make it clearer, I will illustrate in what form a 32-bit signed integer is transmitted from a machine with x86 architecture to most binary protocols for transmission over a network. The number itself, for example –123456789, at the level of processing by the processor and, accordingly, in the logic of the program will look like this:

  | 0 | 1 | 2 | 3 | | 0xF8 | 0xA4 | 0x32 | 0xEB |

While on the network this value is transmitted in the reverse order:

  | 0 | 1 | 2 | 3 | | 0xEB | 0x32 | 0xA4 | 0xF8 |

If we just take an array of bytes and try to get a number from it, interpreting our four bytes as a 32-bit integer, we get a completely different number at the output: –349002504. They generally have nothing in common. For the same, to get the initial number, you must either apply the function `ntohl` - net-to-host-long-integer to the obtained value, or simply collect the desired integer by the pointer to the value in the byte array:

 inline int32_t splice_bytes(uint8_t const* data_ptr) { return (data_ptr[0] << 24) | (data_ptr[1] << 16) | (data_ptr[2] << 8) | data_ptr[3]; }

This function will be a bit cheaper and more efficient than collecting the unnecessary int32_t in an unnecessary order first, in order to later flip its bytes as it should. Among other things, this function will work under any ARM platform, but an attempt to dereference for some ARM platforms bytes as int32_t with an offset that is not a multiple of four, will lead to the completion of the process. In general, when deserializing, it is better not to do anything extra.

Now, in brief, why almost in any binary protocol the high byte is transmitted first across the network. As a rule, the most important information is transmitted over the network first. First, the header with the meta information, then the data itself. First, in the transmitted data there is information that in the next bytes, simply because this data will need to be read correctly. After all, initially it is not known what to read, how much and how. But if you send in the first place detailed instructions on what is going on, then the reading will go smoothly. Also, saving on precious bytes of traffic, most often the meta information about individual numbers is spliced with the number itself, by embedding it in the least-used high-order bits of the number so that it can be pulled out of the number itself and the important bits with meta information .

A simple example we analyzed in the lesson about encoding. In the UTF-8 text encoding, the first byte transmits the meta information about how many bytes must be read to obtain the encoded character. For Cyrillic, using two bytes under the code in the Unicode table, the first three bits of the first byte are always 110, and then comes a piece of character code.

Likewise with integers - their high bytes are almost not used in unsigned integers and are perfect for transmitting meta-information, for which several bits are often enough, for example, integer width: 1, 2, 4 or 8 under the encoding of a number. The easiest way is to get the first byte, understand how many bytes are required to read a number, then count the bytes as an integer and make AND with a mask without bits of meta-information, get the desired number in a minimum of operations.

For character data inside a binary packet, the text is encoded in bytes according to one of the generally accepted encodings. Converting bytes to strings and back we have already passed.

Binary protocols with a fixed format, as a rule, are the easiest to serialize with prepared structures at the compilation stage. Simply bringing a pointer to the data to a pointer to the structure is very cheap, but you need to take into account the packing of the structure fields or store the pointers in the structure by the offset to the data in the transmitted packet, transforming them into fields lazily on request.

An order of magnitude more difficult to disassemble structures that are formed dynamically. Here, there is less room for maneuver and only the protocol and the conditions of the problem we are solving help us. For example, for a data packet in the ASN.1 format, you must first read the header size with a description of the data fields, their size and offset, parse the header, and then it’s time to disassemble the fields themselves by the offset, which we don’t initially know, because comes at the stage of filling in the data.

In this case, we cannot know in advance the data structure and we will have to extract the fields encoded in the data packet dynamically. Here the knowledge of the protocol itself and a simple rule will help us: do not deserialize everything in advance until you need it. So, for most binary protocols, it is enough to distribute information about offsets in a packet and save the data packet bytes themselves, but read only the required fields as needed, hiding all the parsing logic in the implementation details of methods for accessing the entity encoded in the data packet.

To make it a little clearer, let's look at an example. Let us receive information about the payment in the format:

payment identifier (UUID 16 bytes);
payment amount (32-bit unsigned integer);
description (null-terminated string in UTF-8);
billing address (null-terminated UTF-8 string).

Since fields 3 and 4 are most likely of arbitrary length, we cannot simply take and get the address of the fourth field, for us it is too expensive. If the role of the logic of our application is ALWAYS to read and process these fields, you can immediately calculate the beginning of the fourth field label and refer to it as a normal C line. In no case do we calculate values in std :: string without special need - firstly, almost certainly, dynamic memory allocation for one more string, and secondly, unnecessary copying of data that is already represented as a string encoded in UTF-8. It does not make sense to save on the analysis of the payment amount, here our function splice_bytes helps us, but the payment ID can be interpreted as a UUID simply by referring to the first bytes at the beginning of the packet. In fact, if an integer is not sent in the usual network format starting from the high byte, but in the form usual for server logic, then we will receive not just a data packet, but a completely working data set with pointers to the values ready for operation in C / C ++.

Now a few words about real numbers. In fact, the C / C ++ types of float and double do not differ in size from uint32_t and uint64_t and are encoded according to the IEEE 754 standard (recall the lesson “Everything, point, sailed!”). As a rule, in binary protocols, floating point numbers of single and double precision are transmitted and processed in the same way as integer numbers of the corresponding capacity, simply abstracting from filling. After all, for bits in a byte, it does not matter what they mean, whether it is a floating point integer or a real number.

The presentation in the form of numerator and denominator is somewhat less common, and, for example, the time in the NTP protocol is transmitted with seconds with a numerator and a fixed denominator. Nevertheless, regardless of the representation of a real number in the protocol, it is invariably only that the processing of floating-point real numbers during data transmission in the protocols is avoided. Just because the calculation of a real number gives an error that differs on the receiver and sender, and the transfer of the number as it is is usually associated only with the fact that this number was just originally, for example, stored in the database as a field with a real value.
Binary protocols are introduced precisely in order to optimize data serialization, and it is already obvious from the specification how to effectively construct the parsing algorithm. Much more profitable for a serializer are human readable protocols, so beloved by mere mortals.

Harvest Serialization

Let's talk about XML, JSON, YAML, where numbers become strings, and byte sequences are additionally escaped so that they can be passed as strings. What could be more expensive than trying to encode even a kilobyte file into a JSON string using, for example, Base64? Even simple escaping of quotes in ordinary strings passed to JSON is already an expensive operation, and deserialization, of course, will require the inverse operation. The same applies to shielding XML tags in rows with angle brackets in the same SOAP. Here the Serializer rules undividedly and collects such a harvest, which was not even dreamed of in binary protocols.

WARNING

If you value efficiency in data transfer, then prefer binary protocols (if, of course, you have a choice).

Here, in text protocols, losses are monstrous and inevitable. The only way to reduce them somehow is to adhere to a number of simple rules. There are only ten of them, in simple words they sound like this:

Minimize the number of conversions between the serialized and deserialized values. Ideally, we should read once or write one value once, and then on demand, only at the moment this field is needed. If the data processing logic is end-to-end and we implement some kind of data additional processing of the packet before transferring it further, then it makes sense to keep the original data packet, transferring it further with the necessary changes, and so minimize the costs of intermediate serialization / deserialization.
Do not fetch strings everywhere and everywhere under the pretext that the entire data packet in JSON or XML is essentially one big string. Data almost always comes typed, and the type is given to them for a reason. It is not so convenient to handle height / age / weight / amount as a string. Especially considering that the std :: string / std :: wstring container is almost certainly used to store the string, and this led to copying the string representation of the number and probably extracting data on the heap, instead of leading to an integer, or UUID, or logical value true / false.
Optimize to the maximum the process of shielding strings, converting integers and real numbers and logical constants into strings and back. In general, the process of serialization and deserialization should be the place in the code in which you need to be sure. You should know that there is not a single extra amount of time spent on any transformation. Well, you don’t need to replace `\" `with` "in the line to implement a cubic complexity algorithm! It is also worth minimizing the creation of intermediate objects of type std :: string to store temporary results; pointers to the source string and the string with the output result are enough.
Kill the urge to use `std :: stringstream`. Remember that in the end you will have to do str () or run an iterator again over everything that has gathered. This is not to mention the segmentation of memory after active use of `std :: stringstream` in all places where serialization is needed!
Once again: the priority of pointers to characters in a string over all intermediate std :: string with temporary results!
If you use the Boost functions, measure the time of their work compared to the simplest bikes. If it turns out that the Boost functions work 35 times slower than the direct approach, which does nothing superfluous, then they are used in vain!
Do not be afraid of a terrible code, if at stake is the performance of the narrowest part of the code. Let you have a switch in two pages of code that does the work of 100 million iterations in two seconds than polymorphism with a bunch of visitors and a huge stack of calls that perform the same work in five seconds. Remember that these are five servers instead of every two!
You should not push files and other binary data in XML / JSON / YAML strings, it makes sense to request and transfer them with a separate request. The most stupid thing to do is to pack a large binary packet into a string, recode each byte, and then transfer it again as bytes, but in a text protocol.
There is nothing wrong with abandoning something unnecessary or optional. For example, it’s not necessary to write to the log or generate War and Peace for everyone in serialization. Minimize any costs of serialization, do not feed the Serializer beyond what it should receive as an inevitable evil.
There are no authorities, you need to check for everyone, experiment, measure, do not trust anyone - neither the library developers, nor the author of the books, nor the author of this article, nor himself. Believe the result of your code, believe only the numbers that you need to improve.

Do not do anything extra beyond what you should do.

FIN

So, our Serializer, if not defeated, will certainly remain undernourished. From the inevitable eater of the effectiveness of the application, he turned into your obedient servant. We went through a game called Academy C ++ until the end. It is time for titles.

Together we struggled with templates and metaprogramming, won over static typing and received dynamic typing in C ++, we optimized the processes of creating new objects, getting rid of unnecessary memory accesses to the heap, we learned how to work with bytes and strings as different entities, and also learned the essence work floating point numbers. Today's story about efficient serialization completes a six-month series of lectures at our C ++ Academy.

I hope you enjoyed it, even if it was a bit difficult at times. I had to wade through the mechanisms of compile-time predicates, standards for representing real numbers and string encodings, but it was still great to get a small high-level library at the end of the lecture.

Feel free to apply new knowledge! After all, only through trial and error do you get an invaluable and somewhat unique experience. Not a single book and not a single article in the journal will replace the cones you have stuffed yourself. Dare, try! Probably, the STL library at one time would not have happened if Alexander Stepanov had not decided that the C ++ world lacked a library with generalized algorithms and convenient containers with common logic. Do not think that the experience is given with birth, it is directly proportional to the path you have traveled along the road of mastering new opportunities. The main thing is that what you do, what you create yourself, you like. That means you're on the right track. Keep it up!

First published in Hacker Magazine # 194.
Author: Vladimir Qualab Kerimov, Lead C ++ Developer, Parallels

Subscribe to "Hacker"

Source: https://habr.com/ru/post/258959/

All Articles