Data serialization or communication dialectics: simple serialization

Good day, dear. In this article, we will look at the most popular data serialization formats and conduct a little testing with them. This is the first article on the topic of data serialization and we will look at simple serializers, which do not require large changes in the code from the developer to integrate them.

Sooner or later, but you, like our company, can face a situation where the number of services used in your product increases dramatically, and all of them also turn out to be very “talkative”. Whether this happened because of the transition to the “HYIP” microservice architecture today or you just received a pack of orders for minor improvements and implemented them by a handful of services - it does not matter. The important thing is that from now on, your product has got two new problems - what to do with the increased amount of data being driven between individual services, and how to prevent chaos in developing and supporting such a number of services. I’ll explain a little about the second problem: when the number of your services grows to hundreds or more, one development team can no longer develop and maintain them, therefore, you distribute packs of services to different teams. And the main thing is that all these teams use one format for their RPC, otherwise you will encounter such classic problems when one team cannot support the services of the other or just the two services do not fit together without abundant sealing of the junction with crutches. But we will talk about this in a separate article, and today we will pay attention to the first problem of increased data and think about what we can do about it. And we don’t want to do anything because of our Orthodox laziness, but we want to add a couple of lines to the common code and get a profit immediately. With this we begin in this article, namely, consider serializers, the embedding of which does not require major changes in our beautiful RPC.

The issue of format is actually rather painful for our company, because our current products use the xml format to exchange information between components. No, we are not masochists, we are well aware that using xml for data exchange was about 10 years ago, but this is precisely the reason - the product is already 10 years old, and it contains a lot of legacy-architectural solutions that are rather difficult to “cut out” quickly. . After a bit of thinking and stating, we decided that we would use JSON for storing and transmitting data, but we need to choose one of the JSON packing options, since the size of the transmitted data is critical for us (I will explain below why).

We have added a list of criteria by which we will choose the format that suits us:
')

Efficiency of data compression. Our product will handle a huge amount of events to enter from various sources. Each event is caused by some user actions. Basically, the events are small and contain meta information about what is happening - sent a letter, chatted something on Facebook, etc. - but may also contain data, and not a little size. In addition, the number of such events is very large, a few dozen TB can easily be transmitted per day, therefore saving the size of events is critically important for us.
Ability to work from different languages. Since our new project was written using C ++, PHP and JS, we were only interested in the support of these languages, but taking into account the fact that the microservice architecture allows the development environment to be heterogeneous, support for additional languages will be useful. Let's say that the go language is quite interesting for us, and it is possible that some services will be implemented on it.
Support for versioning / evolving data structures. Since our products live for quite a long time without updating clients (the update process is not at all simple), then at some point there will be too many different versions of support, and it’s important that we can easily develop the storage format without losing compatibility with already packed data.
Ease of use. We have experience using the Thrift protocol to build communication between components. Honestly, it's not always easy for developers to figure out how RPC works and how to add something to already existing code without breaking anything in the old one. Therefore, the easier it will be to use the serialization format, the better, since the C ++ level of the developer and the JS developer in such things are completely different :)
The ability to randomly read data (Random-access reads / writes). Since we mean using the chosen format and for storing data, it would be great if it supported the possibility of partial de-serialization of data so as not to read the entire object every time, which is often not so small. In addition to reading data, a big plus would be the possibility of changing data without reading all the content.

After analyzing a decent number of options, we selected for ourselves such candidates:

Json
BSON
Message pack
Corbor

These formats do not require a description of the IDL scheme of the data being transferred, but contain a data scheme inside it. This greatly simplifies the work and allows in most cases to add support by writing no more than 10 lines of code.

We are also well aware that some factors of a protocol or a serializer are highly dependent on its implementation. What perfectly packs in C ++ can pack badly in Javascript. Therefore, for our experiments, we will use implementations for JS and Go and will drive tests. JS implementation for fidelity will drive in the browser and on nodejs.

So, we will start consideration.

Json

The easiest of the interaction formats we are considering. When comparing other formats, we will use it as a reference, as in our current projects it has shown its effectiveness and has shown all its minuses.

Pros:

It supports almost all the data types we need. It would be possible to find fault with the lack of support for binary data, but base64 can be dispensed with.
Easy to read by human, which makes debugging easy
Supported by a bunch of languages (although those who used JSON in Go will understand that I'm cunning here)
You can implement versioning via JSON Scheme

Minuses:

Despite the compactness of JSON as compared to xml, in our project, where gigabytes of data are transmitted per day, it is still quite wasteful for channels and for storing data in it. The only advantage of native JSON is seen only in the use of PostgreSQL storage (with its jsob capabilities).
No support for partial data deserialization. To get something from the middle of the JSON file, you have to first deserialize everything that goes before the required field. It also does not allow the format to be used for stream processing, which can be useful for network interaction.

Let's see what we have with performance. When considering, we will immediately try to take into account the lack of JSON in its size and make tests with JSON packing using zlib. For the tests we will use the following libraries:

http://nodeca.imtqy.com/pako/ - for packaging JSON in JS
http://github.com/klauspost/compress - for packaging JSON in Go
http://github.com/pquerna/ffjson - as a JSON serializer in Go

You can find the source code and all test results at the following links:

Go - https://github.com/KyKyPy3/serialization-tests
JS (node) - https://github.com/KyKyPy3/js-serialization-tests
JS (browser) - http://jsperv.com/serialization-benchmarks/5

Experimentally, we found that test data should be taken as close to real as possible, because test results with different test data differ dramatically. So if it is important for you not to miss the format, always test it on the data closest to your realities. We will be testing on data close to our realities. You can look at them in the source code of tests.

This is what we got for JSON speed. Below are the benchmark results for the respective languages:

JS (Node)
Json encode	21,507 ops / sec (86 runs sampled)
Json decode	9.039 ops / sec (89 runs sampled)
Json roundtrip	6.090 ops / sec (93 runs sampled)
Json compres encode	1,168 ops / sec (84 runs sampled)
Json compres decode	2,980 ops / sec (93 runs sampled)
Json compres roundtrip	874 ops / sec (86 runs sampled)

JS (browser)
Json roundtrip	5.754 ops / sec
Json compres roundtrip	890 ops / sec

Go
Json encode	5000	391100 ns / op	24.37 MB / s	54520 B / op	1478 allocs / op
Json decode	3000	392785 ns / op	24.27 MB / s	76634 B / op	1430 allocs / op
Json roundtrip	2000	796115 ns / op	11.97 MB / s	131150 b / op	2908 allocs / op
Json compres encode	3000	422254 ns / op	0.00 MB / s	54790 B / op	1478 allocs / op
Json compres decode	3000	464569 ns / op	4.50 MB / s	117206 b / op	1446 allocs / op
Json compres roundtrip	2000	881305 ns / op	0.00 MB / s	171795 b / op	2915 allocs / op

But what got the size of the data:

JS (Node)
Json	9482 bytes
Json compressed	1872 bytes

JS (Browser)
Json	9482 bytes
Json compressed	1872 bytes

At this stage, we can conclude that even though JSON compression gives excellent results, the loss in processing speed is simply catastrophic. Another conclusion: JS works great with JSON, which cannot be said, for example, about go. It is possible that processing JSON in other languages will show results incomparable with JS. While we postpone the JSON results aside and see how it will be with other formats.

BSON

This data format came from MongoDb and is actively promoted by them. The format was originally designed for data storage and was not intended for transmission over the network. Honestly, after a brief search on the Internet, we did not find a single serious product that uses BSON inside. But let's see what this format can give us.

Pros:

Support for additional data types.
According to the BSON format specification, in addition to the standard data types of the JSON format, BSON also supports such types as Date, ObjectId , Null and binary data (Binary data). Some of them (for example, ObjectId) are more commonly used in MongoDb and may not always be useful to others. But some additional data types give us the following bonuses. If we store a date in our object, then in the case of JSON format we have only one storage option - this is one of the ISO-8601 variants, and in string representation. At the same time, if we want to filter our collection of JSON-objects by dates, during processing we will need to turn the strings into the Date format and only after that compare them. BSON also stores all dates as Int64 (as well as the Date type) and takes over all the work of serializing / deserializing to the Date format. Therefore, we can compare dates without deserialization - just like numbers, which is clearly faster than the classic JSON version. This advantage is actively used in MongoDb.
BSON supports the so-called Random read / write to its data.
BSON stores lengths for strings and binary data, allowing you to skip attributes that are not interesting to us. JSON, on the other hand, sequentially reads data and cannot send an element without reading its value to the end. Thus, if we store large amounts of binary data inside the format, this feature can play an important role for us.

Minuses:

The size of the data.
As for the size of the final file, then everything is ambiguous. In some situations, the size of the object will be smaller, and in some - more, it all depends on what lies inside the Bson object. Why it happens this way - we will be answered by a specification that says that for the speed of access to the elements of an object, the format saves additional information, such as the size of the data for large elements.

So for example JSON object

{«hello": "world»}

will turn into this:

 \x16\x00\x00\x00 // total document size \x02 // 0x02 = type String hello\x00 // field name \x06\x00\x00\x00world\x00 // field value \x00 // 0x00 = type EOO ('end of object')

The specification says that BSON was designed as a format with fast serialization / deserialization, at least due to the fact that it stores numbers as an Int type, and does not waste time parsing them from a string. Let's check. For testing we took the following libraries:

JS (node and browser) - https://github.com/mongodb/js-bson
Go - https://github.com/go-mgo/mgo/tree/v2

And here are the results we obtained (for clarity, I also added results for JSON):

JS (Node)
Json encode	21,507 ops / sec (86 runs sampled)
Json decode	9.039 ops / sec (89 runs sampled)
Json roundtrip	6.090 ops / sec (93 runs sampled)
Json compres encode	1,168 ops / sec (84 runs sampled)
Json compres decode	2,980 ops / sec (93 runs sampled)
Json compres roundtrip	874 ops / sec (86 runs sampled)
Bson encode	93.21 ops / sec (76 runs sampled)
Bson decode	242 ops / sec (84 runs sampled)
Bson roundtrip	65.24 ops / sec (65 runs sampled)

JS (browser)
Json roundtrip	5.754 ops / sec
Json compres roundtrip	890 ops / sec
Bson roundtrip	374 ops / sec

Go
Json encode	5000	391100 ns / op	24.37 MB / s	54520 B / op	1478 allocs / op
Json decode	3000	392785 ns / op	24.27 MB / s	76634 B / op	1430 allocs / op
Json roundtrip	2000	796115 ns / op	11.97 MB / s	131150 b / op	2908 allocs / op
Json compres encode	3000	422254 ns / op	0.00 MB / s	54790 B / op	1478 allocs / op
Json compres decode	3000	464569 ns / op	4.50 MB / s	117206 b / op	1446 allocs / op
Json compres roundtrip	2000	881305 ns / op	0.00 MB / s	171795 b / op	2915 allocs / op
Bson encode	10,000	249024 ns / op	40.42 MB / s	70085 B / op	982 allocs / op
Bson decode	3000	524408 ns / op	19.19 MB / s	124777 b / op	3580 allocs / op
Bson roundtrip	2000	712524 ns / op	14.13 MB / s	195334 b / op	4562 allocs / op

But what got the size of the data:

JS (Node)
Json	9482 bytes
Json compressed	1872 bytes
Bson	112710 bytes

JS (Browser)
Json	9482 bytes
Json compressed	1872 bytes
Bson	9618 bytes

Although BSON gives us the possibility of additional data types and, most importantly, the possibility of partial reading / changing data, in terms of data compression, it is all very sad, so we have to continue searching further.

Message pack

The next format that came to our table is the Message Pack. This format is quite popular lately and I personally found out about it when picking with tarantool.

If you look at the site format, you can:

Learn that the format is actively used by such products as redis and fluentd, which inspires confidence in it.
See the loud inscription It's like JSON. but fast and small

We'll have to check how true this is, but first let's see what the format offers.

By tradition, let's start with the pros:

The format is fully compatible with JSON
When converting data from MessagePack to JSON, we will not lose data, which cannot be said, for example, about BSON format. True, there are a number of restrictions imposed on various types of data:
1. The value of the type Integer is limited from - (263) to (264) –1;
2. The maximum length of a binary object (232) –1;
3. The maximum size of the byte string (232) –1;
4. The maximum number of elements in the array is no more (232) –1;
5. The maximum number of elements in an associative array is not more than (232) –1;
Pretty good compresses the data.
For example, {“a”: 1, “b”: 2} takes 13 bytes in JSON, 19 bytes in BSON and only 7 bytes in MessagePack, which is pretty good.
It is possible to expand the supported data types.
MsgPack allows you to extend its type system with your own. Since the type in MsgPack is coded with a number, and values from –1 to –128 are reserved by the format (this is stated in the format specification), values from 0 to 127 are available for use. Therefore, we can add extensions that point to our own types data.
It has the support of a huge number of languages.
There is an RPC package (but this is not so important for us).
You can use the streaming API.

Minuses:

Does not support partial data modification.
Unlike the BSON format, even if MsgPack stores the size of each field, you cannot partially change the data in it. Suppose we have a serialized JSON representation {"a": 1, "b": 2}. Bson uses to store the value of the key 'a' 5 bytes, which will allow us to change the value from 1 to 2000 (takes 3 bytes) without problems. But the MessagePack for storage uses 1 byte, and since 2000 occupies 3 bytes, without shifting the data on the 'b' parameter we cannot change the value of the 'a' parameter.

Now let's see how productive it is and how it compresses the data. The following libraries were used for tests:

We got the following results:

JS (Node)
Json encode	21,507 ops / sec (86 runs sampled)
Json decode	9.039 ops / sec (89 runs sampled)
Json roundtrip	6.090 ops / sec (93 runs sampled)
Json compres encode	1,168 ops / sec (84 runs sampled)
Json compres decode	2,980 ops / sec (93 runs sampled)
Json compres roundtrip	874 ops / sec (86 runs sampled)
Bson encode	93.21 ops / sec (76 runs sampled)
Bson decode	242 ops / sec (84 runs sampled)
Bson roundtrip	65.24 ops / sec (65 runs sampled)
MsgPack encode	4,758 ops / sec (79 runs sampled)
MsgPack decode	2,632 ops / sec (91 runs sampled)
MsgPack roundtrip	1.692 ops / sec (91 runs sampled)

JS (browser)
Json roundtrip	5.754 ops / sec
Json compres roundtrip	890 ops / sec
Bson roundtrip	374 ops / sec
MsgPack roundtrip	1,048 ops / sec

Go
Json encode	5000	391100 ns / op	24.37 MB / s	54520 B / op	1478 allocs / op
Json decode	3000	392785 ns / op	24.27 MB / s	76634 B / op	1430 allocs / op
Json roundtrip	2000	796115 ns / op	11.97 MB / s	131150 b / op	2908 allocs / op
Json compres encode	3000	422254 ns / op	0.00 MB / s	54790 B / op	1478 allocs / op
Json compres decode	3000	464569 ns / op	4.50 MB / s	117206 b / op	1446 allocs / op
Json compres roundtrip	2000	881305 ns / op	0.00 MB / s	171795 b / op	2915 allocs / op
Bson encode	10,000	249024 ns / op	40.42 MB / s	70085 B / op	982 allocs / op
Bson decode	3000	524408 ns / op	19.19 MB / s	124777 b / op	3580 allocs / op
Bson roundtrip	2000	712524 ns / op	14.13 MB / s	195334 b / op	4562 allocs / op
MsgPack Encode	5000	306260 ns / op	27.36 MB / s	49907 b / op	968 allocs / op
MsgPack Decode	10,000	214967 ns / op	38.98 MB / s	59649 b / op	1690 allocs / op
MsgPack Roundtrip	3000	547434 ns / op	15.31 MB / s	109754 b / op	2658 allocs / op

But what got the size of the data:

JS (Node)
Json	9482 bytes
Json compressed	1872 bytes
Bson	112710 bytes
Msgpack	7628 bytes

JS (Browser)
Json	9482 bytes
Json compressed	1872 bytes
Bson	9618 bytes
Msgpack	7628 bytes

Of course, MessagePack doesn’t compress data as coolly as we would like, but at least it behaves fairly consistently in both JS and Go. Perhaps, at the moment it is the most attractive candidate for our tasks, but it remains to consider our last patient.

Corbor

To be honest, the format is very similar to MessagePack in its capabilities, and it seems that the format was designed as a replacement for MessagePack. It also has support for data type extensions and full compatibility with JSON. Of the differences, I noticed only support for arrays / lines of arbitrary length, but, in my opinion, this is a very strange feature. If you want to know more about this format, then it was a great article on Habré - habrahabr.ru/post/208690 . Well, we'll see how Cbor works with performance and data compression.

The following libraries were used for tests:

Js - https://github.com/paroga/cbor-js
Go - https://github.com/ugorji/go

And, of course, here are the final results of our tests, taking into account all the formats considered:

JS (Node)
Json encode	21,507 ops / sec ± 1.01% (86 runs sampled)
Json decode	9.039 ops / sec ± 0.90% (89 runs sampled)
Json roundtrip	6.090 ops / sec ± 0.62% (93 runs sampled)
Json compres encode	1,168 ops / sec ± 1.20% (84 runs sampled)
Json compres decode	2,980 ops / sec ± 0.43% (93 runs sampled)
Json compres roundtrip	874 ops / sec ± 0.91% (86 runs sampled)
Bson encode	93.21 ops / sec ± 0.64% (76 runs sampled)
Bson decode	242 ops / sec ± 0.63% (84 runs sampled)
Bson roundtrip	65.24 ops / sec ± 1.27% (65 runs sampled)
MsgPack encode	4,758 ops / sec ± 1.13% (79 runs sampled)
MsgPack decode	2.632 ops / sec ± 0.90% (91 runs sampled)
MsgPack roundtrip	1.692 ops / sec ± 0.83% (91 runs sampled)
Cbor encode	1,529 ops / sec ± 4.13% (89 runs sampled)
Cbor decode	1,198 ops / sec ± 0.97% (88 runs sampled)
Cbor roundtrip	351 ops / sec ± 3.28% (77 runs sampled)

JS (browser)
Json roundtrip	5.754 ops / sec ± 0.63%
Json compres roundtrip	890 ops / sec ± 1.72%
Bson roundtrip	374 ops / sec ± 2.22%
MsgPack roundtrip	1,048 ops / sec ± 5.40%
Cbor roundtrip	859 ops / sec ± 4.19%

Go
Json encode	5000	391100 ns / op	24.37 MB / s	54520 B / op	1478 allocs / op
Json decode	3000	392785 ns / op	24.27 MB / s	76634 B / op	1430 allocs / op
Json roundtrip	2000	796115 ns / op	11.97 MB / s	131150 b / op	2908 allocs / op
Json compres encode	3000	422254 ns / op	0.00 MB / s	54790 B / op	1478 allocs / op
Json compres decode	3000	464569 ns / op	4.50 MB / s	117206 b / op	1446 allocs / op
Json compres roundtrip	2000	881305 ns / op	0.00 MB / s	171795 b / op	2915 allocs / op
Bson encode	10,000	249024 ns / op	40.42 MB / s	70085 B / op	982 allocs / op
Bson decode	3000	524408 ns / op	19.19 MB / s	124777 b / op	3580 allocs / op
Bson roundtrip	2000	712524 ns / op	14.13 MB / s	195334 b / op	4562 allocs / op
MsgPack Encode	5000	306260 ns / op	27.36 MB / s	49907 b / op	968 allocs / op
MsgPack Decode	10,000	214967 ns / op	38.98 MB / s	59649 b / op	1690 allocs / op
MsgPack Roundtrip	3000	547434 ns / op	15.31 MB / s	109754 b / op	2658 allocs / op
Cbor Encode	20,000	71203 ns / op	117.48 MB / s	32944 B / op	12 allocs / op
Corbor decode	3000	432005 ns / op	19.36 MB / s	40216 b / op	2159 allocs / op
Cbor roundtrip	3000	531434 ns / op	15.74 MB / s	73160 B / op	2171 allocs / op

But what got the size of the data:

JS (Node)
Json	9482 bytes
Json compressed	1872 bytes
Bson	112710 bytes
Msgpack	7628 bytes
Corbor	7617 bytes

JS (Browser)
Json	9482 bytes
Json compressed	1872 bytes
Bson	9618 bytes
Msgpack	7628 bytes
Corbor	7617 bytes

Comments, I think, are unnecessary here, everything is perfectly visible from the results - CBor was the slowest format.

findings

What conclusions did we draw from this comparison? After a little thought and looking at the results, we came to the conclusion that none of the formats satisfied us. Yes, MsgPack proved to be quite a good option: it is easy to use and quite stable, but after consulting with colleagues, we decided to take a fresh look at other binary data formats, not based on JSON: Protobuf, FlatBuffers, Cap'n proto and avro. That we succeeded and what we ultimately chose will be discussed in the next article.

Posted by: KyKyPy3uK

Source: https://habr.com/ru/post/312320/

All Articles

Data serialization or communication dialectics: simple serialization

Json

BSON

Message pack

Corbor

findings

More articles: