Working with the Google Protocol Buffer in PHP

In the project that I am developing now, it became necessary to change the protocol that is used to exchange data between parts of the application. Now, at the level of internal services, the exchange takes place via the transfer of serialized PHP arrays over TCP sockets. Since there are applications on PHP on both sides, there are no problems, the format of the data packet is also standard, so there are no special difficulties. Is that often I am not satisfied with the processing speed, as well as the fact that we are strongly tied to the language and platform. If you have to dock with another system or rewrite something, there will be difficulties - after all, only the native language will understand the serialized format, and I don’t really want to write a parser. The initial choice was more than justified - the speed of development and debugging were priorities, now there is little time and desire to look at the architecture with a high and different look.

It should be said that the data is transmitted the simplest - strings (of various lengths, in practice there are almost no more than a kilobyte or a dozen, usually hundreds of bytes), integers (including unix timestamp), some set of constants, true / false flags, only in one case, floating point values are transmitted. In principle, it all comes down to three data types - a string, an integer, a floating point number. If you want, you can select another field of the command code, which can be attributed to the listed type (the number of commands is limited and of course, although it grows with the growth of the system). In a serialized form, such a package takes up quite a lot of space, and although data is transmitted on sockets within the local machine, this is still not an option - initially the system is such that it must allow dynamic expansion to several cluster nodes.

')
Offhand comes the idea to use JSON instead of a PHP array right away, this will solve the problem of understanding the protocol in other languages. But there is an underwater stone in the form, as I understand it, of encoding characters in strings, especially Cyrillic, which is converted to UTF and represented as \ u3490, which significantly increases the amount of data transferred.

So I began (again, oh) to explore various formats for interaction between application services that would allow exchange in the future and between different platforms, be transparently transmitted over the network and be as compact as possible. I really liked the Hessian protocol (for tests, its second implementation is just great), but it’s too closed and very little documentation. Therefore, I mainly considered Facebook / Apache Thrift and Google Protocol Buffer .

Thrift is now open and handed over to Apache Fundation, but its implementation is quite complicated and confusing, along with a minimum of documentation and examples (not to say absence), it was dropped right away, besides, I didn’t manage to make the PHP version work.

But the Google Protocol Buffer (hereinafter, for the abbreviation - PB) turned out to be very pleasant and interesting. However, the difficulties started right away, since my task was to work with him in PHP, and not Python or Java, as the developers suggest. Since there are no materials on this topic, I decided to describe my steps, in case anyone had to do the same. At once I will make a reservation that I will not describe the protocol itself, if you are not familiar with it - there is a good introduction on the project site (for example: Developer Guide ).

And so, the first thing that pleases is that in order to work with PB, there is no need to install additional software or compile an extension for PHP, which means you can conduct experiments on your home machine and on virtual hosting. So far, the only means of working in PHP is the pb4php project , which is in the early development stage (judging by the version number, 0.25, while development is carried out by one person and started more than a year ago, although commits appear from time to time, but the activity is very small). We will use it, but I advise you to soberly assess your needs - if you have enough basic format capabilities and create / read / write / serialize operations, then all is well, more advanced operations are not supported.

For example, let's imagine the simplest option - we have some service that receives news, for example, retrieves from RSS feeds, and there is a server that sends news to subscribers. We want to reduce the exchange of data between both services to the exchange of data through the Protocol Buffer, with one of the services, or both, in our PHP. What and how to do?

First you need to agree on the format of the message, let it be the easiest option:

message News { required int32 id = 1; optional string source = 2; optional string dsign = 3; optional string news_msg = 4; optional int32 n_timestamp = 5; } * This source code was highlighted with Source Code Highlighter .
message News { required int32 id = 1; optional string source = 2; optional string dsign = 3; optional string news_msg = 4; optional int32 n_timestamp = 5; } * This source code was highlighted with Source Code Highlighter .
message News { required int32 id = 1; optional string source = 2; optional string dsign = 3; optional string news_msg = 4; optional int32 n_timestamp = 5; } * This source code was highlighted with Source Code Highlighter .
message News { required int32 id = 1; optional string source = 2; optional string dsign = 3; optional string news_msg = 4; optional int32 n_timestamp = 5; } * This source code was highlighted with Source Code Highlighter .
message News { required int32 id = 1; optional string source = 2; optional string dsign = 3; optional string news_msg = 4; optional int32 n_timestamp = 5; } * This source code was highlighted with Source Code Highlighter .
message News { required int32 id = 1; optional string source = 2; optional string dsign = 3; optional string news_msg = 4; optional int32 n_timestamp = 5; } * This source code was highlighted with Source Code Highlighter .
message News { required int32 id = 1; optional string source = 2; optional string dsign = 3; optional string news_msg = 4; optional int32 n_timestamp = 5; } * This source code was highlighted with Source Code Highlighter .
message News { required int32 id = 1; optional string source = 2; optional string dsign = 3; optional string news_msg = 4; optional int32 n_timestamp = 5; } * This source code was highlighted with Source Code Highlighter .
message News { required int32 id = 1; optional string source = 2; optional string dsign = 3; optional string news_msg = 4; optional int32 n_timestamp = 5; } * This source code was highlighted with Source Code Highlighter .

We have described the simplest message using only the basic types (as long as there are no enumerations and constants, this is already at your discretion). Each field has a numeric tag associated with the message encoding. The only required field, without which our message will not be recognized as correct, the identifier is declared, other fields may be missing. Of course, in a real situation, the format is more complicated, and the choice of the field type is non-trivial, but for now we will omit it. Reading one of the materials about the protocol, I came across the words of the developers that if you often change the format, it is advisable to declare all fields optional, and only in the final version you can already distinguish between them, therefore in the example I have only one mandatory field. In the absence of one or another parameter, it will not be included in the message, which means saving on the size of the data packet.

The specified file is saved as a plain text file in * .proto format. Next, using the compiler (you can download it here for different platforms) we compile the message into a binary format, in fact, a template for later use.

The standard compiler immediately generates a wrapper for the message, objects and utility methods for the supported languages - Java, C ++, Python. However, we are working with PHP, which is not yet on the list.

The pb4php package has a script (in the example / protoc.php directory) that does the same thing (albeit clumsily, honestly) for PHP - loads the specified proto file and generates the structure of the PHP class for working with the message (yes, code on PHP). Note that this compiler works with the textual description of the message, the same * .proto file that you created above.

However, I preferred to generate a wrapper class for the message manually, in particular, I found an error in the sample files themselves. The compiler incorrectly generates a pointer to the data type string - the constant PBMessage :: WIRED_STRING is not in the source, although it is present in all examples, you will have to replace it with PBMessage :: WIRED_LENGTH_DELIMITED.

The wrapper class is inherited from the general PBMessage class, adding descriptions of specific fields of your message, sets getters / setters for the fields. Judging by the code, inside they are stored as a normal associative array, only when serialized is encoded into binary PB format.

Our class is very simple and will be stored in the file pb_news_interface.php:

class News extends PBMessage
{
var $ wired_type = PBMessage :: WIRED_LENGTH_DELIMITED;
public function __construct ($ reader = null )
{
parent :: __ construct ($ reader);
$ this -> fields [ "1" ] = "PBInt" ;
$ this -> values [ "1" ] = 0;
$ this -> fields [ "2" ] = "PBString" ;
$ this -> values [ "2" ] = " ;
$ this -> fields [ "3" ] = "PBString" ;
$ this -> values [ "3" ] = " ;
$ this -> fields [ "4" ] = "PBString" ;
$ this -> values [ "4" ] = " ;
$ this -> fields [ "5" ] = "PBInt" ;
$ this -> values [ "5" ] = 0;
}
function id ()
{
return $ this -> _ get_value ( '1' );
}
function set_id ($ value)
{
return $ this -> _ set_value ( '1' , $ value);
}
function source ()
{
return $ this -> _ get_value ( '2' );
}
function set_source ($ value)
{
return $ this -> _ set_value ( '2' , $ value);
}
function dsign ()
{
return $ this -> _ get_value ( '3' );
}
function set_dsign ($ value)
{
return $ this -> _ set_value ( '3' , $ value);
}
function news_msg ()
{
return $ this -> _ get_value ( '4' );
}
function set_news_msg ($ value)
{
return $ this -> _ set_value ( '4' , $ value);
}
function n_timestamp ()
{
return $ this -> _ get_value ( '5' );
}
function set_n_timestamp ($ value)
{
return $ this -> _ set_value ( '5' , $ value);
}
}
* This source code was highlighted with Source Code Highlighter .

In the constructor, we describe the data format using predefined names for data types, pb4php maps them to classes with basic protocol data types - PBInt, PBBool (inherited from PBInt), PBSignedInt, PBEnum (enums), PBString and PBBytes. Such a wrapper is needed, since not all data types of the protocol can be directly mapped to the built-in data types of the language, while others are simply duplicated, for example, sint32 / int32 in C ++ are mapped to the same int32 type (although the question of data types is not simple and the mapping table from the documentation does not give an exhaustive answer). pb4php implements only a few basic types, so you have to choose the most common types to describe your format.

By the way, the wrapper code itself is far from optimal, it is quite possible to replace it with a simpler and shorter one, using __get / __ set magic methods, apparently the code was also written for compatibility with outdated versions of PHP and OOP capabilities are far from being used in full.

Ok, let's go further. To use the protocol in the program, you need to connect two service files - the main class for working with messages (/message/pb_message.php) and our generated message wrapper class (pb_news_interface.php). After that, we work further as with the usual PHP class.

// Create an instance of the class:
$ new_news = new News ();
// Fill in the fields:
$ new_news-> set_id (1); // set the id to 1
$ new_news-> set_n_timestamp (time ()); // add timestamp
$ new_news-> set_news_msg ( 'Test News' ); // body of the message
// do not fill in the rest of the fields, they will be initialized with default values
* This source code was highlighted with Source Code Highlighter .

For further work, we need to get the serialized value of our message, for example, for further transmission through the network. For this there is a built-in method SerializeToString, which encodes the message into a string (in HEX).

By the way, the author even took care of the built-in method of sending objects over the network - just call the Send method and pass the URL and the message instance to it, then a cOST request will be sent via cURL with the message parameter that contains the serialized PB message. Although in reality, I would not use the built-in data transfer mechanism, rather, just to test the capabilities.

There is also an interesting constant MODUS, which is responsible for the storage format - in binary or string. Binary is more efficient for transmission over the network and saves traffic, a string convenient for testing and reading by the developer.

// Send the message using the built-in method:
$ new_news-> Send ( 'http://domain.com/pb-service/server.php' );
// get the serialized string
$ res = $ new_news-> SerializeToString (); // a strange parameter is passed yet, did not figure out what it is
* This source code was highlighted with Source Code Highlighter .

Conducting the study, I took a sample data set that is being chased in my application, and tried to compare the version with the native PHP serialization and Protocol Buffer in the pb4php version. The gain on the size of the data turned out to be about 30% (127 bytes versus 186 in the usual form), although this does not pretend to be a serious study, it was just interesting to compare the efficiency on my real data set.

The resulting string can be written to a file or transferred in any other way; to restore the original form of an object, it is enough to create an instance and call the ParseFromString method.

$ tmp = $ new_news-> SerializeToString (); // serialize
$ test = new News (); // create a message instance
$ test-> ParseFromString ($ test); // restore the message
// check
echo 'id:' . $ test-> get_id (); // returns: ID: 1
* This source code was highlighted with Source Code Highlighter .

By the way, about some advanced features of PB that are supported in the library - look in the / examples / nested_mess / directory, there is an example of working just with the RSS to Protocol Buffer conversion using messages that contain inside themselves, besides the fields with normal types, same and nested classes.

And so, in the end. Using the capabilities of pb4php we can:

According to the initial specification of the Protocol Buffer, messages generate a PHP class wrapper (albeit of a non-optimal structure);
Work in the PHP script with message classes as with regular PHP classes, fill in the fields with data;
Serialize the message into a hexadecimal string for later writing to a file or sending it over the network (including the built-in HTTP-POST method). The final message usually takes 10 - 30% less than a regular serialized array or JSON (although not sure, I have not tested it with JSON yet, but I suspect that this is the case);
Restore a message to the PHP class from its serialized view.
In messages it is allowed to use basic data types, as well as attached messages.

Of course, serialization takes resources and it is usually always longer than the usual native, although it is almost not noticeable on simple messages (where there are several fields). The main gain is due to data reduction (especially critical when transmitting over the network and large message traffic), as well as the easy opportunity to build services that are independent of the implementation language and platform features. The only thing that does not please is the almost frozen development of the project (meaning pb4php, not the Protocol Buffer itself) and its uniqueness - for many languages there are several implementations. It would be interesting to have also an option with a C-extension for PHP, it would speed up serialization / restore operations as well as support from basic frameworks like Zend Framework.

But even in this form, the project can be used, and if you are faced with the task of quickly learning how to process messages in the Google Protocol Buffer format on PHP, use pb4php!

Source: https://habr.com/ru/post/74573/

All Articles

Working with the Google Protocol Buffer in PHP

More articles: