Comparison of serialization formats

When choosing a serialization format for messages that will be recorded in a queue, a log, or somewhere else, a number of questions often arise that affect the final choice. One of these key issues is the speed of serialization and the size of the received message. Since there are a lot of formats for such purposes, I decided to test some of them and share the results.

Preparation for testing

The following formats will be tested:

Java serialization
Json
Avro
Protobuf
Thrift (binary, compact)
Msgpack

Scala is selected as the PL.
The main tool for testing will be Scalameter .
')
The following parameters will be measured and compared: the time spent on serialization and deserialization, and the size of the resulting files.

Ease of use, the possibility of evolution of the scheme and other important parameters in this comparison will not participate.

Input Generation

For the purity of experiments, it is necessary to pre-generate a data set. The input data format is a CSV file. Data is generated using the simple `Random.next [...]` for numeric values and `UUID.randomUUID ()` for string values. The generated data is written to the csv file using kantan . Total generated 3 sets of data on 100k records each:

Mixed data - 28 mb

Mixed data

final case class MixedData( f1: Option[String], f2: Option[Double], f3: Option[Long], f4: Option[Int], f5: Option[String], f6: Option[Double], f7: Option[Long], f8: Option[Int], f9: Option[Int], f10: Option[Long], f11: Option[Float], f12: Option[Double], f13: Option[String], f14: Option[String], f15: Option[Long], f16: Option[Int], f17: Option[Int], f18: Option[String], f19: Option[String], f20: Option[String], ) extends Data

Only lines - 71 mb

OnlyStrings

 final case class OnlyStrings( f1: Option[String], f2: Option[String], f3: Option[String], f4: Option[String], f5: Option[String], f6: Option[String], f7: Option[String], f8: Option[String], f9: Option[String], f10: Option[String], f11: Option[String], f12: Option[String], f13: Option[String], f14: Option[String], f15: Option[String], f16: Option[String], f17: Option[String], f18: Option[String], f19: Option[String], f20: Option[String], ) extends Data

Only numbers (long) - 20 mb

Onlylongs

 final case class OnlyLongs( f1: Option[Long], f2: Option[Long], f3: Option[Long], f4: Option[Long], f5: Option[Long], f6: Option[Long], f7: Option[Long], f8: Option[Long], f9: Option[Long], f10: Option[Long], f11: Option[Long], f12: Option[Long], f13: Option[Long], f14: Option[Long], f15: Option[Long], f16: Option[Long], f17: Option[Long], f18: Option[Long], f19: Option[Long], f20: Option[Long], ) extends Data

Each entry consists of 20 fields. The value of each field is optional.

Testing

The characteristics of the PC on which the testing took place, the version of scala and java:
PC: 1.8 GHz Intel Core i5-5350U (2 physical cores), 8 GB 1600 MHz DDR3, SSD SM0128G
Java version: 1.8.0_144-b01; Hotspot: build 25.144-b01
Scala version: 2.12.8

Java serialization

	Mixed data	Only longs	Only strings
Serialization, ms	3444.53	2586.23	5548.63
Deserialization, ms	852.62	617.65	2006.41
Size, mb	36	24	86

Json

	Mixed data	Only longs	Only strings
Serialization, ms	5280.67	4358.13	5958.92
Deserialization, ms	3347.20	2730.19	4039.24
Size, mb	52	36	124

Avro

Avro circuit generated on the fly before direct testing. For this library avro4s was used .

	Mixed data	Only longs	Only strings
Serialization, ms	2146.72	1546.95	2829.31
Deserialization, ms	692.56	535.96	944.27
Size, mb	22	eleven	73

Protobuf

Protobuf schema

 syntax = "proto3"; package protoBenchmark; option java_package = "protobufBenchmark"; option java_outer_classname = "data"; message MixedData { string f1 = 1; double f2 = 2; sint64 f3 = 3; sint32 f4 = 4; string f5 = 5; double f6 = 6; sint64 f7 = 7; sint32 f8 = 8; sint32 f9 = 9; sint64 f10 = 10; double f11 = 11; double f12 = 12; string f13 = 13; string f14 = 14; sint64 f15 = 15; sint32 f16 = 16; sint32 f17 = 17; string f18 = 18; string f19 = 19; string f20 = 20; } message OnlyStrings { string f1 = 1; string f2 = 2; string f3 = 3; string f4 = 4; string f5 = 5; string f6 = 6; string f7 = 7; string f8 = 8; string f9 = 9; string f10 = 10; string f11 = 11; string f12 = 12; string f13 = 13; string f14 = 14; string f15 = 15; string f16 = 16; string f17 = 17; string f18 = 18; string f19 = 19; string f20 = 20; } message OnlyLongs { sint64 f1 = 1; sint64 f2 = 2; sint64 f3 = 3; sint64 f4 = 4; sint64 f5 = 5; sint64 f6 = 6; sint64 f7 = 7; sint64 f8 = 8; sint64 f9 = 9; sint64 f10 = 10; sint64 f11 = 11; sint64 f12 = 12; sint64 f13 = 13; sint64 f14 = 14; sint64 f15 = 15; sint64 f16 = 16; sint64 f17 = 17; sint64 f18 = 18; sint64 f19 = 19; sint64 f20 = 20; }

To generate protobuf3 classes, the ScalaPB plugin was used .

	Mixed data	Only longs	Only strings
Serialization, ms	1169.40	865.06	1856.20
Deserialization, ms	113.56	77.38	256.02
Size, mb	22	eleven	73

Thrift

Thrift schema

 namespace java thriftBenchmark.java #@namespace scala thriftBenchmark.scala typedef i32 int typedef i64 long struct MixedData { 1:optional string f1, 2:optional double f2, 3:optional long f3, 4:optional int f4, 5:optional string f5, 6:optional double f6, 7:optional long f7, 8:optional int f8, 9:optional int f9, 10:optional long f10, 11:optional double f11, 12:optional double f12, 13:optional string f13, 14:optional string f14, 15:optional long f15, 16:optional int f16, 17:optional int f17, 18:optional string f18, 19:optional string f19, 20:optional string f20, } struct OnlyStrings { 1:optional string f1, 2:optional string f2, 3:optional string f3, 4:optional string f4, 5:optional string f5, 6:optional string f6, 7:optional string f7, 8:optional string f8, 9:optional string f9, 10:optional string f10, 11:optional string f11, 12:optional string f12, 13:optional string f13, 14:optional string f14, 15:optional string f15, 16:optional string f16, 17:optional string f17, 18:optional string f18, 19:optional string f19, 20:optional string f20, } struct OnlyLongs { 1:optional long f1, 2:optional long f2, 3:optional long f3, 4:optional long f4, 5:optional long f5, 6:optional long f6, 7:optional long f7, 8:optional long f8, 9:optional long f9, 10:optional long f10, 11:optional long f11, 12:optional long f12, 13:optional long f13, 14:optional long f14, 15:optional long f15, 16:optional long f16, 17:optional long f17, 18:optional long f18, 19:optional long f19, 20:optional long f20, }

To generate scala-like thrift classes, the Scrooge plugin was used.

Binary	Mixed data	Only longs	Only strings
Serialization, ms	1274.69	877.98	2168.27
Deserialization, ms	220.58	133.64	514.96
Size, mb	37	sixteen	98

Compact	Mixed data	Only longs	Only strings
Serialization, ms	1294.87	900.02	2199.94
Deserialization, ms	240.23	232.53	505.03
Size, mb	31	14	98

Msgpack

	Mixed data	Only longs	Only strings
Serialization, ms	1142.56	791.55	1974.73
Deserialization, ms	289.60	80.36	428.36
Size, mb	21	9.6	73

Final comparison

Accuracy of results

Important: Serialization and deserialization speed results are not 100% accurate. There is a big error here. Despite the fact that the tests were run many times with additional warm-up of the JVM, it is difficult to call the obtained results stable and accurate. That is why I highly recommend not to draw final conclusions regarding a particular serialization format, focusing on time schedules.

Given the fact that the results are not completely accurate, based on them, you can still make some observations:

Once again convinced that java serialization is slow and not the most economical in terms of the volume of output data. One of the main reasons for slow work is to refer to the fields of objects with the help of reflection. By the way, the access to the fields and their further recording does not occur in the order in which you declared them in the class, but in sorted in lexicographical order. This is just an interesting fact;
Json is the only text format presented in this comparison. Why data serialized in json takes up a lot of space - each record is recorded with the schema. It also affects the speed of writing to the file: the more bytes you need to write, the longer it takes. Also, do not forget that a json-object is created for each record, which also does not reduce the time;
When serializing an object, Avro analyzes the schema in order to decide how to handle a particular field. These are additional costs leading to an increase in total serialization time;
Thrift, compared to, for example, protobuf and msgpack, requires more memory to write a single field, since its meta information is saved along with the field value. Also, if you look at the output files of thrift, you can see that various identifiers of the beginning and end of the record and the size of the entire record as a separator occupy a small fraction of the total. All this certainly only increases the time spent on packaging;
Protobuf, like thrift, packs meta information, but makes it somewhat more optimized. Also, the difference in the packing and unpacking algorithm itself allows this format to work faster than others in some cases;
Msgpack works pretty fast. One of the reasons for the speed is the fact that no additional meta information is serialized. This is both good and bad: good because it takes up little space on the disk and does not require additional recording time, bad because in general nothing is known about the recording structure, therefore the definition of how to package and unpack something or a different value is performed for each field of each record.

As for the size of the output files, the observations are quite unambiguous:

The smallest file for a numeric set turned out in msgpack;
The smallest file for string typing was found in the source file :) Apart from the source file, avro won by a small margin from msgpack and protobuf;
The smallest file for mixed-set again turned out at msgpack. However, the gap is not so noticeable and very close are avro and protobuf;
The biggest files turned out at json. However, it is necessary to make an important note - the json text format and comparing it with binary ones in terms of volume (and also in terms of serialization speed) is not entirely correct;
The largest file for a numeric set turned out in standard java serialization;
The largest file for a string set turned out at thrift binary;
The largest file for the mixed set turned out at thrift binary. Behind him comes the standard java serialization.

Format analysis

Now let's try to understand the results obtained using the example of serializing a string with a length of 36 characters (UUID) without taking into account separators between records, different identifiers of the beginning and end of the record - only a record of 1 string field, but taking into account such parameters as, for example, the type and number of the field . Consideration of serialization of a line quite covers several aspects at once:

Number serialization (in this case, the string length)
String serialization

Let's start with avro. Since all fields are of type `Option`, the scheme for such fields will be the following:` union: [“null”, “string”] `. Knowing this, you can get the following result:
1 byte to specify the type of record (null or string), 1 byte for the length of the string (1 byte because avro uses variable-length to write integers) and 36 bytes for the string itself. Total: 38 bytes.

Now consider the msgpack. Msgpack uses integer-like approach to write integers: spec . Let's try to calculate how much it actually takes to write a string field: 2 bytes for the length of the string (since the string is> 31 bytes, then 2 bytes are needed), 36 bytes for the data. Total: 38 bytes.

Protobuf also uses variable-length to encode numbers. However, in addition to the length of the string, protobuf adds another byte with the number and type of the field. Total: 38 bytes.

Thrift binary does not use any optimization for writing the length of a string, but instead of 1 byte for the number and type of the field, thrift takes 3. Therefore, the following result is obtained: 1 byte for the field number, 2 bytes for the type, 4 bytes for the length of the string, 36 bytes for string. Total: 43 bytes.

Thrift compact , unlike binary, uses the variable-length approach to record integers and additionally use the abbreviated header of the field if possible. On this basis, we get: 1 byte for the type and field number, 1 byte for the length, 36 bytes for the data. Total: 38 bytes.

Java serialization took 45 bytes to write a string, of which 36 bytes is a string, 9 bytes is 2 bytes per length, and 7 bytes for some additional information, which I could not decrypt.

Only avro, msgpack, protobuf and thrift compact are left. Each of these formats will require 38 bytes to write utf-8 strings with a length of 36 characters. Why, then, with the packing of 100k string records, a smaller volume turned out to be in avro, although not a compressed scheme was also recorded with the data? Avro has a small gap from other formats and the reason for this gap is the absence of additional 4 bytes per package of the length of the entire record. The fact is that neither msgpack, nor protobuf, nor thrift have a special record separator. Therefore, in order for me to correctly unpack the records back, I needed to know the exact size of each record. If it were not for this fact, then, with a high probability, the smaller file would be in msgpack.

For a numeric data set, the main reason for the msgpack win was the lack of schema information in the packed data and the fact that the data was sparse. In thrift and protobuf, even empty values will take more than 1 byte due to the need to pack information about the type and number of the field. Avro and msgpack require exactly 1 byte to write an empty value, but avro, as already mentioned, saves the scheme with the data.

Msgpack also packed a smaller file and a mixed set, which was also sparse. The reasons for this are all the same factors.

Thus, it turns out that the data packed in msgpack occupies the least space. This is quite a valid statement - it’s not for nothing that msgpack was chosen as the storage format for tarantool and aerospike.

Conclusion

After testing, I can draw the following conclusions:

Getting stable benchmark results is difficult;
Format selection is a trade-off between serialization speed and output size. At the same time, one should not forget about such important parameters as the usability of the format and the possibility of schema evolution (often these parameters play a dominant role).

Source code can be viewed here: github

Source: https://habr.com/ru/post/458026/

All Articles