Protobuffers are wrong

Much of my professional life, I oppose the use of Protocol Buffers. They are clearly written by amateurs, incredibly narrowly specialized, suffer from a variety of pitfalls, are difficult to compile and solve a problem that in fact no one else has but Google. If these problems of protobuffers remained in the quarantine of serialization abstractions, then my complaints would have ended. But unfortunately, the bad design of Protobuffers is so intrusive that these problems can leak into your code.

Narrow specialization and development by amateurs

Stop it. Close your email client, where they have already sent me a half letter saying that “the best engineers in the world are working at Google,” that “by definition, their development cannot be created by amateurs”. I don't want to hear that.
')
Let's just not discuss this topic. Full disclosure: I've worked at Google. It was the first (but, unfortunately, not the last) place where I have ever used Protobuffers. All the problems I want to talk about exist in the Google code base; it is not just "the misuse of protobuffers" and similar nonsense.

Of course, the biggest problem with Protobuffers is a terrible type system. Java fans should feel at home here, but, unfortunately, literally no one considers Java as a well-designed type system. The guys from the dynamic typing camp complain about excessive restrictions, while representatives of the static typing camp, like me, complain about excessive restrictions and the lack of everything that you really want from the type system. Losing in both cases.

Narrow specialization and development by amateurs go hand in hand. Much of the specifications seem to be screwed at the last moment - and it was clearly screwed at the last moment. Some restrictions will make you stop, scratch your head and ask: “What the hell?” But these are just symptoms of a deeper problem:

Obviously, the protobuffers were created by amateurs, because they offer poor solutions to well-known and already solved problems.

Lack of composition

Protobuffers offer several features that do not work with each other. For example, look at the list of orthogonal, but at the same time limited, typing functions that I found in the documentation.

oneof fields cannot be repeated .
There is a special syntax for keys and values in map<k,v> fields, but it is not used in any other types.
Although map fields can be parameterized, no more user-defined type is possible. This means that you are stuck with manually specifying your own specializations of common data structures.
map fields cannot be repeated .
map keys can be string , but not bytes . Enum is also prohibited, although the latter are considered equivalent to integers in all other parts of the Protobuffers specification.
map values can not be other map .

This insane list of restrictions is the result of an unprincipled choice of design and screwing functions at the last moment. For example, one of the fields can not be repeated , because instead of a side type the code generator will produce mutually exclusive optional fields. Such a transformation is valid only for a singular field (and, as we will see later, does not work even for it).

The restriction of map fields, which cannot be repeated , is approximately from the same opera, but shows another type system restriction. Behind the scenes, the map<k,v> transformed into something like a repeated Pair<k,v> . And since repeated is a magic keyword of a language, and not a normal type, it is not combined with itself.

Your guesses about the problem with enum are as true as mine.

What is so frustrating about all this is the poor understanding of how modern type systems work. This understanding would dramatically simplify the Protobuffers specification and, at the same time, remove any arbitrary restrictions .

The solution is as follows:

Make all the fields in the required message. This makes each message a product type.
Raise the value of oneof field to autonomous data types. This will be the coproduct type.
Allow other types to parameterize product types and products.

That's all! These three changes are all you need to determine any possible data. With this simple system, you can redo all the other specifications of Protobuffers.

For example, you can remake the optional fields:

 product Unit { // no fields } coproduct Optional<t> { t value = 0; Unit unset = 1; }

Creating fields repeated also easy:

 coproduct List<t> { Unit empty = 0; Pair<t, List<t>> cons = 1; }

Of course, the actual serialization logic allows you to do something smarter than push related lists across the network - after all, implementation and semantics do not have to match each other .

Doubtful choice

Java-based Protobuffers distinguishes scalar types and message types. The scalars more or less correspond to the machine primitives — things like int32 , bool and string . On the other hand, message types are everything else. All library and user types are messages.

Of course, the two types of types have completely different semantics.

Fields with scalar types are always present. Even if you have not installed them. I have already said that (at least in proto3 ¹ ) are all protobuffers initialized with zeros, even if they have absolutely no data? Scalar fields get fake values: for example, uint32 initialized to 0 , and string initialized to "" .

It is impossible to distinguish a field that was missing in the protobuffer from a field to which a default value is assigned. Presumably, this decision is made to optimize, so as not to send scalar defaults. This is only an assumption, because this optimization is not mentioned in the documentation, so your assumption will be no worse than mine.

When we discuss Protobuffers claims of an ideal solution for backward and future API compatibility, we will see that this inability to distinguish between unspecified values and default values is a real nightmare. Especially if it is a truly conscious decision to save one bit (set or not) for the field.

Compare this behavior with message types. While scalar fields are stupid, the behavior of message fields is completely insane . Internally, the message fields are either there or they are not, but the behavior is crazy. A small pseudocode for their accessor is worth a thousand words. Imagine this in Java or somewhere else:

 private Foo m_foo; public Foo foo { // only if `foo` is used as an expression get { if (m_foo != null) return m_foo; else return new Foo(); } // instead if `foo` is used as an lvalue mutable get { if (m_foo = null) m_foo = new Foo(); return m_foo; } }

In theory, if the foo field is not set, you will see a default initialized copy, ask you or not, but you cannot change the container. But if you change foo , it will also change its parent! All this is just to avoid using the Maybe Foo type and its associated “headache” to figure out what an unspecified value should mean.

This behavior is especially flagrant because it violates the law! We expect the job msg.foo = msg.foo; will not work. Instead, the implementation actually quietly changes msg to a copy of foo with zero initialization if it was not there before.

Unlike scalar fields, at least here you can determine that the message field is not set. Language bindings for protobuffers offer something like the generated bool has_foo() method. If it is present, then in case of frequent copying of the message field from one protobuffer to another, the following code should be written:

 if (src.has_foo(src)) { dst.set_foo(src.foo()); }

Note that, at least in statically typed languages, this template cannot be abstracted due to the nominal relationship between the foo() , set_foo() and has_foo() . Since all these functions are own identifiers , we have no means for their program generation, with the exception of the preprocessor macro:

 #define COPY_IFF_SET(src, dst, field) \ if (src.has_##field(src)) { \ dst.set_##field(src.field()); \ }

(but preprocessor macros are not allowed by Google style guides).

If instead all of the additional fields were implemented as Maybe , you could easily put abstracted call points.

To change the subject, let's talk about another dubious decision. Although you can define oneof fields in oneof , their semantics does not match the type of coproduct! Newbie bug, guys! Instead, you get an optional field for each one of the cases and a magic code in the setters that will simply cancel any other field, if it is set.

At first glance it seems that this should be semantically equivalent to the correct type of union. But instead we get a disgusting, indescribable source of errors! When this behavior is combined with the illegal implementation, msg.foo = msg.foo; , this seemingly normal assignment silently deletes arbitrary amounts of data!

As a result, this means that oneof fields oneof not form law-abiding Prism , and messages do not form law-abiding Lens . So good luck with trying to write non-trivial manipulations with protobuffers without bugs. It is literally impossible to write a universal, error-free, polymorphic code on protobuffers .

This is not very pleasant to hear, especially among us who love parametric polymorphism, which promises exactly the opposite .

Lies backward and future compatibility

One of the commonly referred to as “killer features” Protobuffers is their “hassle-free ability to write back and forward-compatible APIs”. This statement was hung in front of your eyes to obscure the truth.

What Protobuffers are permissive . They manage to cope with messages from the past or the future, because they make absolutely no promises what your data will look like. Everything is optional! But if you need it, Protobuffers will gladly cook and serve you something with a type check, regardless of whether it makes sense.

This means that the Protobuffers perform the promised “time travel” by quietly doing the wrong things by default . Of course, a cautious programmer can (and should) write code that checks the correctness of the received protobuffers. But if you write defensive validation checks on each site, maybe it just means that the deserialization step was too permissive. All you managed to do was to decentralize the validation logic from a clearly defined boundary and spread it across the entire codebase.

One of the possible arguments is that the protobuffers will keep in the message any information that they do not understand. In principle, this means non-destructive transmission of a message through an intermediary who does not understand this version of the scheme. This is a clear victory, isn't it?

Of course, on paper this is a cool feature. But I have never seen an application where this property is really stored. With the exception of routing software, no program wants to check only certain bits of a message, and then forward it unchanged. The vast majority of programs on the protobuffers will decode the message, transform it into another one and send it to another place. Alas, these transformations are made to order and encoded manually. And manual transformations from one protobuffer to another do not save unknown fields, because it is literally meaningless.

This ubiquitous attitude to protobuffers as universally compatible is also manifested in other ugly ways. Protobuffers style guides actively oppose DRY and suggest that they embed definitions in code as much as possible. They argue that this will allow the use of individual messages in the future if the definitions diverge. I emphasize that they propose to abandon the 60-year practice of good programming , just in case , all of a sudden you will need to change something in the future.

The root of the problem is that Google combines the value of data with its physical representation. When you are on the Google scale, this makes sense. In the end, they have an internal tool that compares the programmer's hourly pay using the network, the cost of storing X bytes, and other things. Unlike most technology companies, the salary of programmers is one of the smallest items of expenditure of Google. Financially, it makes sense for them to spend the time of programmers to save a couple of bytes.

In addition to the five leading technology companies, no one else is within five orders of magnitude of Google. Your startup cannot afford to spend engineering hours on saving bytes. But saving bytes and wasting programmers in the process is exactly what Protobuffers are optimized for.

Let's face it. You do not fit the scale of Google, and never will. Stop using technology only because "Google uses it," and because "these are industry best practices."

Protobuffers pollutes code bases

If it were possible to limit the use of Protobuffers only to the network, I would not speak so harshly about this technology. Unfortunately, although in principle there are several solutions, none of them are good enough to actually be used in real software.

Protobuffers correspond to the data you want to send over the communication channel. They often correspond , but are not identical, to the actual data with which the application would like to work. This puts us in an awkward position; you must choose between one of three bad choices:

Maintain a separate type that describes the data that you really need, and guarantee simultaneous support for both types.
Pack complete data in a format for transmission and use by the application.
Extract complete data each time it is needed, from a short format for transmission.

Option 1 is definitely a “right” solution, but it is not suitable for Protobuffers. The language is not powerful enough to encode types that can do double work in two formats. This means that you have to write a completely separate data type, develop it synchronously with Protobuffers and specifically write serialization code for them . But since most people seem to use Protobuffers to not write serialization code, this option is obviously never implemented.

Instead, code using protobuffers allows them to be distributed throughout the codebase. This is reality. My main project at Google was the compiler, which took a “program” written on one type of Protobuffers, and gave an equivalent “program” to another. The input and output formats were different enough that their correct parallel versions of C ++ never worked. As a result, my code could not use any of the rich compiler writing techniques, because the data from Protobuffers (and the generated code) was too hard to do anything interesting with them.

As a result, instead of 50 lines of recursion schemes , 10,000 lines of special buffer shuffling were used. The code I wanted to write was literally impossible with protobuffers.

Although this is one case, it is not unique. Due to the harsh nature of code generation, the manifestations of protobuffers in languages will never be idiomatic, and they cannot be made so - except to rewrite the code generator.

But even then, you still have the problem of embedding the crap type system into the target language. Since most of the features of Protobuffers are poorly thought out, these questionable properties leak into our code bases. This means that we have to not only implement, but also use these bad ideas in any project that hopes to interact with Protobuffers.

On a solid basis, it is easy to realize meaningless things, but if you go in a different direction, at best you will encounter difficulties, and at worst - with real ancient horror.

In general, give up hope to everyone who incorporates Protobuffers into their projects.

1. To this day, Google has a stormy discussion about proto2 and whether to ever mark fields as required . At the same time, the manifesto “ optional is considered harmful” and “ required is considered harmful.” Good luck with this guys. ↑

Source: https://habr.com/ru/post/427265/

All Articles

Protobuffers are wrong

Lack of composition

Doubtful choice

Lies backward and future compatibility

Protobuffers pollutes code bases

More articles: