GSON. Add a little rigor to it and solve the problem of memory overflow when processing large JSON files.

Probably many have come across Google's GSON library, which easily turns JSON files into Java objects and back.

For those who did not come across it, I prepared a short description under the spoiler. He also described the solutions on GSON of two problems that he really encountered in his work (solutions are not necessarily optimal or better, but perhaps they may be useful to someone):

1) Checks that we have not lost any fields from the JSON file, as well as checks that all the required fields in the Java class have been filled (we make GSON more strict);
2) Manual parsing using GSON, when you have to handle a very large JSON file to avoid the error out of memory.

So, first, what is GSON on the fingers: ...

Those who already know about GSON will most likely not be interested, you can skip

... GSON allows literally two lines to convert JSON into Java objects. It is very often used for integration between different platforms and systems, serialization and deserialization, as well as for interaction between the web part in javascript and the Java backend.
')
So, we have say, such json, received from another application:

{ "summary": { "test1_id": "1444415", "test2_id": "4444935" }, "results": { "details": [ { "test1_id": "1444415", "test2_id": "4444935" }, { "test1_id": "1444415", "test2_id": "4444935" } ] } }

We describe a similar structure in Java objects (heters and setters, etc. I will not write for simplicity):

  static class JsonContainer { DataContainer summary; ResultContainer results; } static class ResultContainer { List<DataContainer> details; } static class DataContainer { String test1_id; String test1_id; }

And literally two lines convert one into another.

  Gson gson = new GsonBuilder().create(); JsonContainer jsonContainer = gson.fromJson(json, JsonContainer.class);//  Json  Java ... // -  String json = gson.toJson(jsonContainer);//   Java  json

As you can see, everything is very simple. If we don’t like names that are wrong for Java, use the SerializedName annotation, that is, we write:

  static class JsonContainer { DataContainer summary; ResultContainer results; } static class ResultContainer { List<DataContainer> details; } static class DataContainer { @SerializedName("test1_id") String test1Id; @SerializedName("test2_id") String test2Id; }

Naturally, not only String can be automatically used as field types, but also any primitive and their wrappers, enum, date (the date format can be specified), objects with generics, and much more. For enum values, you can also specify SerializedName if the value in json does not match the name of the enum constant. You can also, of course, add your own handlers for individual classes, like this:

 Gson gson = new GsonBuilder().registerTypeAdapter(DataContainer.class, new DataContainerDeserializer<DataContainer>()).create(); class DataContainerDeserializer<T> implements JsonDeserializer<T> { @Override public T deserialize(JsonElement json, Type type, JsonDeserializationContext context) throws JsonParseException { ... //   JsonElement     return /*   Java  */ } }  ,     JsonDeserializer .  ,   GSON'        .

Problem number 1. Unstable conversions

When converting from json to Java, GSON ignores all fields that are missing in the Java class and does not pay attention to the NotNull annotations. It would seem that there is no special problem, well, it ignores and ignores, for many purposes (for example, the evolution of classes during serialization / deserialization) this is very convenient. Yes, it's really convenient sometimes. But imagine that we integrated with another company's system and suddenly the object field turned into objects (by mistake, the developers were “on the other side” because we didn’t notice in the design of that home system ”after 12 at night the carriage turns into a pumpkin, that is, field1 becomes field2 ", for a million other reasons). Either they added an important field, but forgot to tell us. Worse, if the integration works in both directions: system A sent us an object with extra fields (which we did not know), GSON ignored them, we did something with the object and sent it back to system A, and they safely wrote it down to the base, having decided that we have deleted extra fields for our own reasons. Everything is a spoiled phone in action and it's good if you manage to catch it with QA or analytics on some side, or they may not.

I could not find a normal solution in GSON itself, how to make it more stringent. Yes, it was possible to fasten a separate validation using json schemes or somehow do manual validation, but it seemed to me that much better to use the capabilities of GSON itself, namely JsonDeserializer, turned into a Validator (maybe someone can suggest the best way) proper class:

Big source code

 package com.test; import com.google.common.collect.ObjectArrays; import com.google.gson.*; import com.google.gson.annotations.SerializedName; import gnu.trove.set.hash.THashSet; import javax.validation.constraints.NotNull; import java.lang.annotation.Annotation; import java.lang.reflect.Field; import java.lang.reflect.Type; import java.util.*; public class TestGson { private static String json = "{\n" + " \"summary\": {\n" + " \"test1_id\": \"1444415\",\n" + " \"test2_id\": \"4444935\"\n" + " },\n" + " \"results\": {\n" + " \"details\": [\n" + " {\n" + " \"test1_id\": \"1444415\",\n" + " \"test2_id\": \"4444935\"\n" + " },\n" + " {\n" + " \"test1_id\": \"1444415\",\n" + " \"test2_id\": \"4444935\"\n" + " }\n" + " ]\n" + " }\n" + "}"; public static void main(String [ ] args) { Gson gson = new GsonBuilder() .registerTypeAdapter(DataContainer.class, new VaidateDeserializer<DataContainer>()) //     DataContainer .create(); JsonContainer jsonContainer = gson.fromJson(json, JsonContainer.class); } static class JsonContainer { DataContainer summary; ResultContainer results; } static class ResultContainer { List<DataContainer> details; } static class DataContainer { @NotNull @SerializedName("test1_id") String test1Id; @SerializedName("test2_id") String test2Id; } static class VaidateDeserializer<T> implements JsonDeserializer<T> { private Set<String> fields = null; //      private Set<String> notNullFields = null; //       NotNull private void init(Type type) { Class cls = (Class) type; Field[] fieldsArray = ObjectArrays.concat(cls.getDeclaredFields(), cls.getFields(), Field.class); //     (, ,        fields = new THashSet<String>(fieldsArray.length); notNullFields = new THashSet<String>(fieldsArray.length); for(Field field: fieldsArray) { String name = field.getName().toLowerCase(); //     Annotation[] annotations = field.getAnnotations(); //     boolean isNotNull = false; for(Annotation annotation: annotations) { if(annotation instanceof NotNull) { //     NotNull isNotNull = true; } else if(annotation instanceof SerializedName) { name = ((SerializedName) annotation).value().toLowerCase(); //   SerializedName        fields  notNullFields } } fields.add(name); if(isNotNull) { notNullFields.add(name); } } } @Override public T deserialize(JsonElement json, Type type, JsonDeserializationContext context) throws JsonParseException { if(fields == null) { init(type); //            } Set<Map.Entry<String, JsonElement>> entries = json.getAsJsonObject().entrySet(); Set<String> keys = new THashSet<String>(entries.size()); for (Map.Entry<String, JsonElement> entry : entries) { if(!entry.getValue().isJsonNull()) { //   json,    null keys.add(entry.getKey().toLowerCase()); //       json } } if (!fields.containsAll(keys)) { //    json,    Java  -  throw new JsonParseException("Parse error! The json has keys that isn't found in Java object:" + type); } if (!keys.containsAll(notNullFields)) { //   Java    NotNull,   json   -  throw new JsonParseException("Parse error! The NotNull fields is absent in json for object:" + type); } return new Gson().fromJson(json, type); //     GSON } } }

Actually, what we do. In the comments, everything is described in sufficient detail, but the bottom line is that we assign the JsonDeserializer to the class that we are going to check (or all classes). When we first refer to it with reflection, we raise the class structure and annotations to the fields (they are already saved and we don’t waste time on reflection) if we find extra fields in json or the absence of fields marked as NotNull by us falling with JsonParseException. Naturally, in production you can fall more gently by writing errors to logs or to a separate collection. In any case, we can immediately find out "that these are the wrong bees and they give the wrong honey" and change something until we have lost important data. But now GSON will work strictly.

Problem number 2. Large files and memory overflow

As far as I know, GSON receives all data immediately into memory, that is, by making fromJson we get a heavy object with the entire json structure in memory. While json files are small this is not a problem, but if there suddenly appears to be an array of a couple of million objects, we risk getting out of memory. Of course, it would be possible to abandon GSON and work in my project with two different json parsing libraries (but for a number of reasons this would not be desirable), but fortunately there is a gson.stream.JsonReader that allows you to parse json on tokens without loading everything in memory (or, say, discarding to a disk in some format or periodically writing the results to a database). In fact, GSON itself works with JsonReader. The general algorithm for working with JsonReader is also very simple (I will write briefly only the essence of the work, since everything will depend on the structure of each particular json, especially as there are excellent examples of use in the Javadoc, JsonReader):

 JsonReader jsonReader = new JsonReader(reader); //   reader,  fileReader,     json   ,

jsonReader has the following methods:

 - hasNext() -      (, ,   ..) - peek() -    (, ,       ..) - skipValue -   - beginObject(), beginArray() -    /      - endObject(), endArray() -   /      - nextString() -       -  ..

Pay attention only to the fact that hasNext () returns a value only for the current object / array, and not the entire file (this turned out to be unexpected for me), and the fact that you should always carefully check the type of token using peek (). Otherwise, parsing large files in this way will of course be somewhat less convenient than just one fromJson () command, but nevertheless it is written in a few hours for a simple json structure. If you know the best way to get GSON to work with the file piece by piece without loading a heavy object into memory, write it in the comments, I would be very grateful (it only occurred to me to save the parsed objects in the JsonDeserializer and give null, but this solution looks much less beautiful) than honest parsing tokens). Immediately I answer, I did not want to use other libraries for a number of reasons in this case, but the advice in which libraries these problems can be solved more easily will also be useful for me.

Thank you all for your attention.

** PS ** I also advise you to look at my opensource project [useful-java-links] (https://github.com/Vedenin/useful-java-links/tree/master/link-rus) - perhaps the most complete collection useful Java libraries, frameworks and Russian-language instructional videos. There is also a similar [English version] (https://github.com/Vedenin/useful-java-links/) of this project and start the opensource subproject [Hello world] (https://github.com/Vedenin/useful-java -links / tree / master / helloworlds) for preparing a collection of simple examples for different Java libraries in one maven project (I will be grateful for any help).

Source: https://habr.com/ru/post/245263/

All Articles

GSON. Add a little rigor to it and solve the problem of memory overflow when processing large JSON files.

Problem number 1. Unstable conversions

Problem number 2. Large files and memory overflow

More articles: