google protocol buffers: polymorphism, search

The article consists of two parts. The first part contains a free retelling of an article about polymorphism in protobuf . The second part is devoted to ~~annoying ads~~ samopisnym "bike" to work with the framework.

Update : As rightly noted in the comments, I mixed inheritance and polymorphism in one heap. To remove the very gross errors ~~and add new ones~~ , I changed the text a little. Therefore, if it seems to you that some comment has nothing to do with the text, then most likely it simply refers to the previous version. I apologize for this inconvenience.

NB : The article does not answer the question “what is google protocol buffers” and is not tied to any particular programming language.

')
So, the statement of the problem:

How to implement polymorphism when working with protocol buffers?
How to search and find the necessary data, having a large file with messages?

NB: Here, instead of the English word message, either the word “message” or the word “structure” will be used. Just in some sentences, the word "message" sounds strange.

Part I: Polymorphism

Suppose our protocol contains three classes of objects: Square, Circle and Polygon. Suppose also that they all have a color field and an id field. In this place, we have several reasons to inherit them from the common ancestor of Shape and get the ~~desired~~ polymorphism (well, or at least the ability to refer to an object using the base type). And, probably, if we wrote in a language with support for inheritance, our code would look as follows:

pseudocode

enum Color { RED, GREEN, BLUE } struct Point { int x; int y; } struct Shape { int id; Color color; } struct Square extends Shape { Point corner; int width; } struct Circle extends Shape { Point center; int radius; } struct Polygon extends Shape { Point [] points; }

Unfortunately, google protocol buffers do not support hierarchies. Jon Parise is considering three ways around this limitation.

Using optional fields

With this approach, we create a separate structure for each class of successor, and the Shape structure contains optional fields for each case.
This approach has several serious drawbacks:

You cannot create a new class heir without changing the base class. In the case, if you expand someone else's protocol, it can become problematic.
You can create an unknown square-circle by initializing both (square and circle) fields.
To determine the resulting type, you have to write code that checks for the presence of fields.

geom-1.proto

  enum Color { RED = 1; GREEN = 2; BLUE = 3; } message Point { required fixed32 x = 1; required fixed32 y = 2; } message Square { required Point corner = 1; required fixed32 width = 2; } message Circle { required Point center = 1; required fixed32 radius = 2; } message Polygon { repeated Point points = 1; } message Shape { required TYPE type = 1; required fixed32 id = 2; optional Color color = 3; //  optional Square square = 4; optional Circle circle = 5; optional Polygon polygon= 6; }

Nested serialization

Another approach is considering creating a Shape structure with fields common to the heirs, and adding another field where the heir’s already serialized fields lie.

It is also not the most successful option, since the “serializer” will not “unpack” the contents of the subclass field automatically , and therefore no integrity check will be performed.
Anyway, not beautiful somehow.

geom-2.proto

  enum TYPE { SQUARE = 1; CIRCLE = 2; POLYGON = 3; } enum Color { RED = 1; GREEN = 2; BLUE = 3; } message Point { required fixed32 x = 1; required fixed32 y = 2; } message Square { required Point corner = 1; required fixed32 width = 2; } message Circle { required Point center = 1; required fixed32 radius = 2; } message Polygon { repeated Point points = 1; } message Shape { required TYPE type = 1; required fixed32 id = 2; optional Color color = 3; //  required bytes subclass = 4; }

Nesting extensions

The third (recommended) approach is similar to the first, but nested extensions are used instead of optional fields. To fight the square-circle, the type field is started.

geom-final.proto

  enum TYPE { SQUARE = 1; CIRCLE = 2; POLYGON = 3; } enum Color { RED = 1; GREEN = 2; BLUE = 3; } message Point { required fixed32 x = 1; required fixed32 y = 2; } message Shape { required TYPE type = 1; required fixed32 id = 2; optional Color color = 3; extensions 4 to max; } message Square { extend Shape { required Square shape = 5; } required Point corner = 1; required fixed32 width = 2; } message Circle { extend Shape { required Circle shape = 6; } required Point center = 1; required fixed32 radius = 2; } message Polygon { extend Shape { required Polygon shape = 7; } repeated Point points = 1; }

Let's take a closer look at the benefits of this approach.

Using extensions instead of optional fields allows you to:
1. "Inherit" other people's definitions (protocols)
2. Describe the "heirs" classes in other files. When there are many such classes, this can be important.
The fact that the extensions are nested, allows you to give the fields the same name, since in reality these will be names in different scopes. In our example, these are Square.shape, Circle.shape, and Polygon.shape; In languages with template support, very convenient.
Unlike the second approach, everything will be deserialized by the framework.

Part II: Search

And so, we have a file with a description of the message structure (geom.proto). Our program worked and created a large file with the messages themselves. I would like to find the necessary information in it, but a simple text search is not always possible.

For example:

Find the circles that intersect the coordinate axes.
Find polygons with empty dots
Find polygons with more than 10 points.

Agree, the usual grep will not help us here.

Of course it is not difficult, for each such task to write a small program, so that I would look for shapes in the file, with the specified properties. However, since we have structured data and their format, why not write your own query language?

Let's return with examples of tasks:

type == "CIRCLE" && ((center.x - radius) < 0 || (center.y - radius) < 0)) // Find the circles intersecting the coordinate axes.
type == "POLYGON" && #points == 0 // Find polygons with an empty set of points
type == "POLYGON" && #points > 10 // Find polygons with more than 10 points.

All this, and much more, is able to make a samopisny bicycle .

And also, he can

Transfer files from binary to text format and vice versa.
Cut and print only certain fields from messages
Use multiple messages in one request. For example, find all the squares that go right behind the circles.

I hope that the bike will find its user, and with it, there will be a need for a detailed topic about the syntax and possibilities of the bike.
Thank.

Source: https://habr.com/ru/post/226225/

All Articles