📜 ⬆️ ⬇️

Java serialization: how to look inside the black box

From time immemorial in Java there is a wonderful serialization mechanism that allows, without any special mental effort, to save as a sequence of bytes arbitrarily complex object graphs. The storage format is well documented, there are a lot of examples, serialized objects “weigh” quite a bit, are sent over the network at times, there are a lot of possibilities for customization ... It all sounds great, but only until you are left alone. some multi-megabyte binary file containing very, very valuable and necessary data right now.

How to get into this file with your bare hands and understand what is stored inside this huge serialized object graph, without having the source code? These and many other questions can be answered by Serialysis - a library that will allow you to analyze in detail serialized java-objects (the serialized form is my version of the translation of the expression serial forms, decided not to go far from the original). In this way, you can get information about an object that is not available through its public API. The library is also a useful tool for testing the serialization of your own classes.


From the translator:
Saturday. Evening. Nothing foreshadowed the work on this day, but suddenly I remember that it would be nice to check how our job at the hadoop cluster was doing, just like that, to calm the conscience - because the problem was solved ...
')
<Lyrical digression>
In the past few days, quite a lot of tasks began to end with OutOfMemoryError on our hadoop cluster in production, there was no more possibility to increase the amount of allocated memory, and we with the IT department spent a fair amount of time trying to find the cause. It ended with the fact that our American colleague thoughtfully looked at the configs, corrected a couple of lines, and said that the problem was solved.
And indeed, everything was fine on Friday, and we once again rejoiced at having Cloudera Certified Developer in the team.
</ Lyrical digression>

But it was not there!
Khadup showed that not a single task was fulfilled on this ill-fated Saturday.
The cause of the crashes was somewhat different from the previous one: the task tracker could not start the task because it did not have enough memory to load the task's xml-file.

Of course, I immediately wondered what kind of monstrous configurations are stored there? Alas, most of it was a serialized blob of about fifty megabytes. A blob consists of a graph of objects of a dozen different classes for which I certainly do not have source codes.

So what can be done with this multi-megabyte binary on a Saturday night with the help of improvised means only?

This is where my savior comes on the scene: Serialysis. A couple of lines of code - and in my hands there is a complete dump of the internals of a serialized object, with the names of classes and fields. Having a full dump in my hands, I find the problem, turn on gzip compression for the dictionary of strings, patch classes using JBE . Voila - the problem is solved!

This is a hack, of course, but sometimes without hacks - nowhere.

PS The library is long-standing, but at the moment it turned out to be most welcome. To admit, some of the applications that the author has found for the library, seem to me very strange. For God's sake, well, they did not let me know the port, which means it is not very necessary! In my opinion, the best use of this technology is debugging and troubleshooting of all kinds, in this area there really isn’t any equal.

Actually, the article:

When a public API is not enough

The reason for writing the Serialysis library is that I ran into some tasks when I needed information about the object, which I could not get through the public API, but it was available through a serialized form.

For example, you have a stub for a remote RMI object, and you want to know what address or port it will connect to, or which RMI socket factory (RMIClientSocketFactory) will use. The standard RMI API does not provide a way to extract information from a stub. In order for the stub to function after deserialization, this information must be present in serialized form. Therefore, we could get the necessary information if only we could somehow disassemble the serialized stub.

The second example is taken from the JMX API . Requests to the MBean server are represented by the QueryExp interface. The QueryExp examples are built using the methods of the Query class . If your object belongs to QueryExp, how do you know which queries it performs? The JMX API does not offer any way to find out. Information must be presented in a serialized form so that when a client makes a request to a remote server, it may be restored on the server. If we can see the serialized form, we can determine what the request was.

The second example prompted me to write this library. Existing standard JMX connectors are based on Java serialization, so they do not require any special handling of QueryExps. But in the new web services
The connector introduced in JSR 262 is used by XML for serialization. How to analyze QueryExp, then to convert it to XML? The answer is simple: the WS connector uses the version of this library to look inside the serialized QueryExp.

All these examples have one thing in common: they show gaps in the corresponding APIs. So, we need methods to extract information from the RMI stub. Just as you need a way to convert QueryExp back to the original Query method that spawned it. (Even standard parsing toString () is enough for parsing). But there are no such methods now, and if we want code that will work with these APIs in their current form, we need a different approach.

We penetrate into private fields of objects

If you have the source code of the classes you are interested in, then there is a great temptation to just get in and get the desired data. In the example with the RMI stub, we can find out by experiment that the getRef () stub method returns sun.rmi.server.UnicastRef, and examining the JDK sources, we find out that this class contains a field of the ref type sun.rmi.transport.LiveRef, like Once with the information that we need. So we’ll get something like this (but I’ll say in advance, don’t do it):

import sun.rmi.server.*; import sun.rmi.transport.*; import java.rmi.*; import java.rmi.server.*; public class StubDigger { public static getPort(RemoteStub stub) throws Exception { RemoteRef ref = stub.getRef(); UnicastRef uref = (UnicastRef) ref; Field refField = UnicastRef.class.getDeclaredField("ref"); refField.setAccessible(true); LiveRef lref = (LiveRef) refField.get(uref); return lref.getPort(); } } 

Perhaps the result will suit you perfectly, but, I repeat, I do not advise doing this - this code is no good. First, never use the dependency on the classes sun. *, Because no one can guarantee that they will not change beyond recognition with any update of the JDK, moreover, your code will definitely not be easy to port to other JDK platforms. Secondly, when you see something like Field.setAccessible , then you should take it as a stop sign. This means that your code depends on undocumented fields, which may change from release to release, or worse, which may persist, but with modified semantics.

(This code was written for JDK 5. As it turned out, LiveRef acquired the public getPort () method in JDK 6, so you no longer need Field.setAccessible. But in any case, you shouldn’t depend on sun. * Classes.)

Of course, sometimes it will not work to find a better solution. But if those classes that you are seriously interested in turned out to be serializable, then it is quite possible that you will succeed. The fact is that the serialized form of a class is part of its contract. If the API is not completely lost, then its external contract will be compatible with its previous versions. This is a very important condition, in particular for the JDK platform.

So if the necessary information is not available through the public methods of the classes, but at least it is part of a documented serialized form, then it can be hoped that it will continue to remain unchanged in serialized form.

The description of the serialized form is included in the Javadoc in the "See Also" section for each class being sold. You can find the serialized forms of all publicly available JDK classes here on one huge page.

Hello, Serialysis!

My library for receiving metadata of serialized objects is called Serialysis, from the connection of the words “serial analysis”.

Let me give you a simple example of how this works. This code ...

  SEntity sint = SerialScan.examine(new Integer(5)); System.out.println(sint); 

... will bring this ...

 SObject(java.lang.Integer){ value = Prim(int){5} } 

This suggests that an object of type java.lang.Integer, which we passed to SerialScan.examine, is serialized as an object with a single field of type int inside. If we check the documented serialized java.lang.Integer form , we will see that this is exactly what was expected.

If you look at the java.lang.Integer source code, you will see that the class itself also has a single value field of type int:

 /** * The value of the <code>Integer</code>. * * @serial */ private final int value; 

But private fields are implementation details. In the update, the field can be renamed or replaced with a new one inherited from the parent class java.lang.Number , or any other. And there is no guarantee that this will not happen, but there is a guarantee that the serialized form will remain unchanged. Serialization provides a mechanism to save the serialized form in its original form, even if the class fields have changed.

Here is a more complex example. Suppose that for some reason we want to know how big the array is inside an ArrayList . The API does not give us the necessary information, although it allows us to force us to select an array of not less than the specified one.

If we look at the serialized form of ArrayList , we will see that it contains the information we are looking for. It indicates the serialized size field, which is the number of items in the list, but this is not what we need. But the binary data in the WriteObject method just contains what you need:

Serial Data:
contains the length of the ArrayList internal array, followed by all the elements (each as an object) in the specified order.

If we run this code ...

 List<Integer> list = new ArrayList<Integer>(); list.add(5); SObject slist = (SObject) SerialScan.examine(list); System.out.println(slist); 

... then we get the following conclusion ...

 SObject(java.util.ArrayList){ size = SPrim(int){1} -- data written by class's writeObject: SBlockData(blockdata){4 bytes of binary data} SObject(java.lang.Integer){ value = SPrim(int){5} } } 

Here we fall into the dark jungle of serialization. In addition to serializing the fields of an object, or instead of it, a class can have a writeObject (ObjectOutputStream) method that writes arbitrary data to a stream using methods of the type ObjectOutputStream.writeInt . The class must also contain the corresponding readObject method, which reads the same data, and with the help of the @serialData tag, you should document what the WriteObject method writes, just as done in ArrayList.

The writeObject data in Serialysis can be obtained through the SObject.getAnnotations () method, which returns a List. Each object written using the ObjectOutputStream.writeObject (Object) method is represented in this list as a SObject. Each piece of data written by one or more consecutive calls to ObjectOutputStream methods inherited from DataOutput ( writeInt , writeUTF, and so on) is represented as SBlockData. The serialized stream does not allow to separate individual elements inside this piece; this information is an agreement between the writer and the reader, documented in the @serialData tag.

Based on the ArrayList documentation, we can get the size of the array in this way:

 SObject slist = (SObject) SerialScan.examine(list); List<SEntity> writeObjectData = slist.getAnnotations(); SBlockData data = (SBlockData) writeObjectData.get(0); DataInputStream din = data.getDataInputStream(); int alen = din.readInt(); System.out.println("Array length: " + alen); 

How Serialysis solves my test problems

Omitting the full source code, I’ll only give a sketch of the solution to the QueryExp problem, which I mentioned at the beginning. Suppose I have QueryExp built like this:

 QueryExp query = Query.or(Query.gt(Query.attr("Version"), Query.value(5)), Query.eq(Query.attr("SupportsSpume"), Query.value(true))); 

This means: “give me MBeans with a Version attribute greater than 5 or a SupportsSpume attribute equal to true”. The toString () for this request in the JDK looks like this:

 ((Version) > (5)) or ((SupportsSpume) = (true)) 

And this is what the result of SerialScan.examine looks like:

 SObject(javax.management.OrQueryExp){ exp1 = SObject(javax.management.BinaryRelQueryExp){ relOp = SPrim(int){0} exp1 = SObject(javax.management.AttributeValueExp){ attr = SString(String){"version"} } exp2 = SObject(javax.management.NumericValueExp){ val = SObject(java.lang.Long){ value = SPrim(long){5} } } } exp2 = SObject(javax.management.BinaryRelQueryExp){ relOp = SPrim(int){4} exp1 = SObject(javax.management.AttributeValueExp){ attr = SString(String){"supportsSpume"} } exp2 = SObject(javax.management.BooleanValueExp){ val = SPrim(boolean){true} } } } 

It is easy to imagine code that is immersed in this structure, creating an XML equivalent. Each compatible JMX API implementation is required to create exactly the same serialized form, so the code that analyzes it is guaranteed to work anywhere.

Now the code that solves the problem of the port number in the RMI stub :

  public static int getPort(RemoteStub stub) throws IOException { SObject sstub = (SObject) SerialScan.examine(stub); List<SEntity> writeObjectData = sstub.getAnnotations(); SBlockData sdata = (SBlockData) writeObjectData.get(0); DataInputStream din = sdata.getDataInputStream(); String type = din.readUTF(); if (type.equals("UnicastRef")) return getPortUnicastRef(din); else if (type.equals("UnicastRef2")) return getPortUnicastRef2(din); else throw new IOException("Can't handle ref type " + type); } private static int getPortUnicastRef(DataInputStream din) throws IOException { String host = din.readUTF(); return din.readInt(); } private static int getPortUnicastRef2(DataInputStream din) throws IOException { byte hasCSF = din.readByte(); String host = din.readUTF(); return din.readInt(); } 

To understand it, take a look at the description of the serialized form of RemoteObject .

This code is, of course, difficult, but it is easy to port and promising to use. I think it makes no sense to explain how to extract all other data from the RMI stubs - use the same method.

Conclusion

Most likely, you do not want to dig into serialized forms until there is a serious need. But if you can not do without it, Serialysis is able to significantly simplify your task.

In addition, it is a good way to check that your own classes are being serialized exactly as you would expect.

Download the Serialysis library from here: http://weblogs.java.net/blog/emcmanus/serialysis.zip .

Source: https://habr.com/ru/post/195736/


All Articles