The dark side of protobuf

Among developers, it is often the opinion that protobuf serialization protocol and its implementation is a special, outstanding technology that can solve all real and potential performance problems with just one fact of its use in a project. Perhaps this perception is affected by the ease of application of this technology and the credibility of Google itself.

Unfortunately, on one of the projects I had to face closely some features that are not mentioned in the advertising documentation, but they strongly influence the technical characteristics of the project.

All the following discussion applies only to the protobuf implementation on the Java platform. Also, version 2.6.1 is mainly described, although I did not see any fundamental changes in the already released version 3.0.0.

I also draw the fact that the article does not claim to be complete review. You can read about the good aspects of technology (for example, it is multilingual and excellent documentation) on the official site . This article tells only about problems and, probably, will allow to make more weighed decision. One part of the problems relates to the format itself, the other part concerns the implementation. You also need to clarify that most of the problems mentioned here manifest themselves under certain conditions.
')
maven-project with already connected dependencies for independent research can be taken on github .

0. The need for preprocessing

This is the smallest problem, I did not even want to include it in the list, but let it be mentioned for completeness. In order to get the java-code, you must run the protoc compiler. Some problem is that this compiler is a native application and on each of the platforms the executable file will be its own, so it’s not possible to manage the simple connection of the maven-plug-in. At a minimum, an environment variable is needed on the development machines and on the CI server, which will point to the executable file, and after that it can already be run from the maven / ant script.

As an option, you can do maven-pluging, which keeps all the binaries in resources and unpacks itself for the current platform into a temporary folder, from where it launches it. I do not know, maybe someone has already done this.

In general, small sin, so forgive.

1. Impractical code

Unfortunately, for the Java platform, the protoc generator produces very impractical code. Instead of generating clean anemic containers and separate serializers for them, the generator pushes everything into one big class with subclasses. Generated beans cannot be embedded in their hierarchy, or even trivially implemented the java.util.Serializable interface for pushing bins to somewhere else. In general, they are only suitable as highly specialized DTOs. If this suits you, then this is not a problem at all, just don’t look inside.

2. Excessive copying - poor performance

Actually here I have already started completely objective problems. The generated code for each described entity (let's call it “Bean”) creates two classes (and one interface, but it is not important in this context). The first class is the immutable Bean which is a read-only data cast, the second class is the mutable Bean.Builder, which can already be edited and set values.

Why is this done, it remains incomprehensible. Someone says that the authors are part of the sect of adherents of the OP; someone claims that they tried to get rid of circular dependencies during serialization (how did this help them?); someone says that protobuf of the first version worked only with mutable classes, and stupid people shot at their feet.

It could be said that the taste and color of the architecture are different, but with this design in order to get a byte representation you need to create a Bean.Builder, fill it, then call the build () method. In order to change a bin, you need to create its builder via the toBuilder () method, change the value and then call build ().

And that's all, only every time you call build () and toBuilder (), all the fields are copied from an instance of one class to an instance of another class. If all you need is to get a byte array to serialize or change a pair of fields, then this copying is a lot of a nuisance. In addition, in this method, it seems ( I am clarifying now ) there is a perennial problem, which leads to the fact that even those fields are copied, the values of which were not even set in the builder.

You will hardly notice this if you have small bins with a small number of fields. However, I inherited a whole library, the number of fields in separate bins of which reached three hundred. Calling the build () method for such a bin takes about 50µs in my case, which allows processing no more than 20,000 bins per second.

The irony is that in my case other tests show that saving a similar bean through Jackson / JSON is two to three times faster (if not all fields are initialized and most of the fields can be non-serialized).

3. Loss of reference

If you have a graph-like structure in which bins link to each other, then I have bad news for you - protobuf is not suitable for serializing such structures. It saves beans by value, without tracking the fact that this bean has already been serialized.

In other words, if you have bean1 and bean2 that refer to each other, then during serialization-de-serialization you will get bean1, which refers to the bean3 bin; and also bean2, which refers to bean4 bin.

I am sure that in the overwhelming majority of cases such functionality is not needed and is even contraindicated in simple DTO. However, this problem manifests itself in more natural cases. For example, if you add the same bin to the collection 100 times, it will be saved all 100 times, not once. Or you serialize a list of lots (goods). Each of the lots is a small bin with a description (quantity, price, date), as well as with reference to a spreading description of the product. If you save in the forehead, then the product description will be serialized as many times as there are lots, even if all lots point to the same product. The solution to this problem will be the separate preservation of products in the form of a dictionary, but these are additional actions - both during serialization and deserialization.

The described behavior is absolutely expected and natural for text formats such as JSON / XML. But here you expect something different from the binary format, especially since the standard Java serialization in this area works exactly as expected.

4. Compactness is questionable

Argued that protobuf is a super-compact format. In fact, the compactness of serialization is provided by just a few factors:

The var-int and var-long types are implemented and used by default, both signed and unsigned. Fields of these types can save space if the real values in these fields are small. In other words, if the distribution over the entire range of values is uneven and the bulk of the values are concentrated around zero. For example, if you save the value of 23L, it will take only one byte instead of eight. But on the other hand, if you save Long.MAX_VALUE, then all ten bytes will take this value.
Instead of full metadata (field names), only numeric field identifiers are stored. Actually for the sake of this we specify the identifiers in the proto-files and that is why they must be unique and unchanged. Identifiers are stored in fields of the var-int type, so it makes sense to start them exactly with 1.
Fields for which there was no setting of values through setters are not saved. For this, protobuf, when setting values via setters, also sets the corresponding field in a separate bit mask. It was not without problems, since setting this value to 0L still sets the bit, although it is obvious that there is no need to save such a field, since in most languages 0 is the default value. For example, Jackson, when serializing, when it decides to serialize this field or not, looks at the immediate value of the field.

And all this is wonderful, but only if we look at the byte representation of the DTO average (but I will not speak for all) of the modern service, we will see that most of the space will be occupied by strings, but not primitives. These are logins, names, names, descriptions, comments, resource URIs, and often in several ways (permissions for images). What does protobuf do with strings? In general, nothing special - just saves them to a stream as UTF-8. At the same time, we remember that national characters in UTF-8 occupy two or even three bytes each.

Suppose the application generates such data that, as a percentage, the bytes represent strings that occupy 75%, and primitives occupy 25%. In this case, even if our algorithm for optimizing primitives reduces the space required to store them to zero, we will get only 1/4 savings.

In some cases, compact serialization is very critical, for example, for mobile applications in poor / expensive communication conditions. In such cases, additional compression over protobuf is indispensable, otherwise we will be wasting the redundant data in the rows. But then suddenly it turns out that a similar set of [JSON + GZIP] during serialization gives a slightly larger size compared to [PROTOBUF + ZIP]. Of course, the [JSON + GZIP] option will also consume more CPU resources during operation, but at the same time, it is often also more convenient.

protoc v3

In protobuf third version, a new generation mode “Java Nano” has appeared. It is not yet in the documentation, and the runtime of this mode is still in the alpha stage, but you can use it now with the "--javanano_out" switch.

In this mode, the generator creates anemic bins with public fields (without setters and without getters) and with simple serialization methods. There is no unnecessary copy, so problem # 2 has been resolved. The remaining problems remain, moreover, if there are cyclic references, the serializer falls in StackOverflowError.

The decision to serialize each field is made based on its current value, rather than a separate bitmask, which somewhat simplifies the beans themselves.

protostuff

An alternative implementation of the protobuf protocol. In the battle did not feel, but at first glance it looks very solid. Does not require proto-files (but can work with them if necessary), therefore problems # 0, # 1 and # 2 are solved. In addition, it can save in its own format, as well as in JSON, XML and YAML. Also interesting is the ability to stream data from one format to another stream, without the need for complete deserialization into an intermediate bin.

Unfortunately, if you serialize a regular POJO without schema, annotations and without proto-files (this is also possible), protostuff will save all fields of the object in a row, regardless of whether they were initialized by value or not, and this again strongly beats compactness In case when not all fields are filled. But as far as I can see, this behavior can be corrected if desired, by redefining a couple of classes.

Source: https://habr.com/ru/post/310032/

All Articles