By reverse engineering, in this context, I mean the restoration of the original message layout closest to the original used by the developers. There are several ways to get what you want. First, if we have access to the client application, the developers have not taken care to hide the debugging symbols and link to the LITE version of the protobuf library, then getting the original .proto files will not be difficult. Secondly, if developers use LITE library assemblies, this certainly complicates the life of the reverser, but does not make reversing a useless exercise: with a certain skill, even in this case, you can restore .proto files quite close to the original.
In this article, I would like to describe some techniques of reverse ptobobuf messages, thanks to which my protodec project appeared. I will note that everything said relates to the protobuf encoding format of message version 2 (version 3 is not yet supported, packed fields too).
Training
To begin, I will create objects for research. We will need 2 files:
addressbook.protopackage tutorial; option optimize_for = LITE_RUNTIME; message Person { required string name = 1; required int32 id = 2; optional string email = 3; enum PhoneType { MOBILE = 0; HOME = 1; WORK = 2; } message PhoneNumber { required string number = 1; optional PhoneType type = 2 [default = HOME]; } repeated PhoneNumber phone = 4; } message AddressBook { repeated Person person = 1; }
tut.cpp #include <iostream> #include <cassert> #include <string> #include "addressbook.pb.h" int main() { GOOGLE_PROTOBUF_VERIFY_VERSION; tutorial::AddressBook book; tutorial::Person * person = book.add_person(); person->set_id(1234); person->set_name("John Doe"); person->set_email("jdoe@example.com"); tutorial::Person_PhoneNumber * phone = person->add_phone(); phone->set_number("555-4321"); phone->set_type(tutorial::Person_PhoneType_HOME); std::string data = book.SerializeAsString(); assert(!data.empty()); std::cout.write(&data[0], data.size()); google::protobuf::ShutdownProtobufLibrary(); }
Save them and collect everything together. If you do not know what protoc is, then you need to read the introduction to the Protobuf library for your programming language.
')
protoc --cpp_out=. addressbook.proto && g++ addressbook.pb.cc tut.cpp `pkg-config --cflags --libs protobuf` -s -o tut.lite.exe && ./tut.lite.exe > A
Delete or comment out the second line of the addressbook.proto file and execute the command:
protoc --cpp_out=. addressbook.proto && g++ addressbook.pb.cc tut.cpp `pkg-config --cflags --libs protobuf` -o tut.exe && ./tut.exe > B
After executing the above commands, we have two executable files tut.lite.exe and tut.exe, with LITE and the full build of the libprotobuf library, respectively. Both programs do the same thing: a protobuf message is created, which is output to std :: cout. We also have two binary files with the names A and B. The first is generated by the lite version, the second is the full version of the program. Their content is identical. In the screenshot below you can see the binary representation of this message and its textual appearance:

Remove addressbook.proto and try to restore it.
Recovery of message scheme from Descriptor of executable data
Look at the contents of the adressbook.pb.cc file generated earlier by the protoc utility. We should be interested in the protobuf_AddDesc_addressbook_2eproto function. One of the first actions in it is to call the function :: google :: protobuf :: DescriptorPool :: InternalAddGeneratedFile, the first argument of which is the Descriptor protobuf message with information about the structure of the original messages.
It contains information about listings, import lists, messages, names and data types of their fields, etc. The format is not a secret and comes with the source code; you can look at google / protobuf / descriptor.proto. This data is used in reflection, for debug output of message content, etc.
The protodec utility searches Descriptor data in a binary file and is able to save the recovered .proto files. To do this, run the command:
protodec --grab tut.exe
In response, we will see something like this:

That is, in the end, we got almost the original source .proto file.
Recovery scheme from message bytes
If there is no access to the application (for example, it works somewhere on the server), then it will be problematic to get to the Descriptor data. The same applies if the application is compiled with LITE optimization: reflection is not used, therefore, the Descriptor description of .proto files is not generated at the compilation stage, and therefore we will not be able to restore the original .proto files by the method mentioned earlier. In this case, you can try to analyze the contents of protobuf messages. I note that they must be 100% have the same structure (the root message must be the same for them). We will need as many messages as possible; the more data they contain, the better the result will be in the end.
The protodec program can restore the schema of the specified protobuf message with its types loaded from a file. To do this, run the command:
protodec --schema A

This output means that 3 messages were detected in this protobuf message (loaded from file A). If we look at the original addressbook.proto, then the general is undoubtedly guessed: MSG1 is Person :: PhoneNumber, MSG2 is Person, and MSG3 is AddressBook. I will describe the striking inconsistencies:
- The MSG3.fld1 field must be repeated. The problem here is that in the original message, in AddressBook.person there is only one element, and at the binary level it is impossible to distinguish the repeated field in this case. If there were at least 2 elements in AddressBook.person, then it would be defined correctly. That is why we need several messages of this scheme, with the maximum fullness;
- Some required fields must be optional. This problem is also solved by analyzing a large number of messages, thanks to which one can understand where the required field should be, and where optional;
- The MSG2.fld2 field must be int32, and it is int64. At the low level, in protobuf, all integer types (int32, int64, uint32, uint64, sint32, sint64, bool, enum) are stored as Varint. Then you can understand from the context, whether the numbers in this field will be signed or unsigned, int64 is selected in order to save the maximum possible integer value for the programming language used.
Names, both fields and messages, are generated automatically, this metadata from the body of the protobuf message "get" is impossible, because they are simply not there. In this case, you can gradually rename messages and fields when their purpose becomes more or less clear from the context of the messages being examined. Also, in the application itself, in the export list, you can sometimes find this information. To do this, we need any utility that knows how to do this, for example, IDA. Here, here we retrieved the names and order of the fields for the message tutorial :: Person, which has 4 fields:
We do the same for the rest of the messages and as a result we get almost the original .proto file.
Check
As a result, we have approximately this .proto file:
tut2.proto package ProtodecMessages; message PHONE { required string Number = 1; required int64 Type = 2; } message PERSON { required string Name = 1; required int64 Id = 2; required string Email = 3; required PHONE Phone = 4; } message ADDRESSBOOK { repeated PERSON Person = 1; }
We will write a small program to check that our restored schema can edit the original messages.
tut2.cpp #include <iostream> #include <fstream> #include <string> #include <cassert> #include "tut2.pb.h" int main() { GOOGLE_PROTOBUF_VERIFY_VERSION; // protobuf std::cin std::string data; ProtodecMessages::ADDRESSBOOK book; while (std::cin.peek() != EOF) data.push_back((char)std::cin.get()); // ? assert(book.ParseFromString(data)); assert(book.person_size() > 0); // ProtodecMessages::PERSON * person = book.mutable_person(0); person->set_email("fake@name.com"); person->set_id(4321); // std::cout data = book.SerializeAsString(); assert(!data.empty()); std::cout.write(&data[0], data.size()); // Optional: Delete all global objects allocated by libprotobuf. google::protobuf::ShutdownProtobufLibrary(); }
Compile and run:
protoc --cpp_out=. tut2.proto && g++ tut2.pb.cc tut2.cpp `pkg-config --cflags --libs protobuf` -o tut2.exe
References: