Java String Internal Representation Changes

Introduction

This is my first translation of the article. I decided to dedicate it to Java, oddly enough. This article describes the changes to the string data type that occurred in Java.

IMPORTANT: In Java 7 update 17, as before, nothing has changed regarding the article.

Splitting char [] character array

In the present implementation of the String class, there are 4 instance fields: the char [] value character array contains the string characters, int offset is the index of the first character used in the value array, int count is the number of characters used, and int hash is the cached value of the calculated hash code for this string . As you can see, in most cases the string will have the values offset = 0 and count = value.length. The only exception to this rule is possibly when a string is created by calling the viaString.substring method or by any method that uses this method (for example, split).

String.substring creates a string that separates the internal char [] character array with the original string, which allows you to:

Save memory using it together;
Make the time to make the String.substring method constant (O (1)).

At the same time, with such an implementation a memory leak is possible: if you extract a small substring from a large string and you no longer need a large string, you will still have a link to a large array of characters of the original string. The only way to avoid this is to call the constructor new String (String), which will force you to create a copy of the character array, that is, detach your small string from the large “parent”.
')
In Java 1.7.0_06, the offset and count fields have been removed from the String class. This means that we can no longer share an array of characters. Now we can forget about the memory leaks described above and never use the new String (String) constructor again. As a drawback, you should always remember that the String.substring method now has linear complexity, unlike the constant that was before.

Changes in hashing logic

Another change introduced in the String class in the same update: a new hashing algorithm. Oracle says the new algorithm gives better distribution of hash codes, which should improve the performance of some hashing-based collections: HashMap, Hashtable, HashSet, LinkedHashSet, WeakHashMap, and ConcurentHashMap. Unlike the changes described at the beginning of the article, these changes are experimental and are turned off by default.

As you probably noticed, these changes are true only for the case when the keys of the collections are strings. If you want to enable these changes, you must set the value of the jdk.map.althashing.threshold system property to a non-negative value (by default, it is -1). This value is the threshold of the collection size, after which the new hashing algorithm will be used. Small correction: the hashing algorithm will be changed only when there is a shortage of memory. If the collection was redistributed with size = 160 and jdk.map.althashing.threshold = 200, then the hash method will be changed only when the collection size grows to 320.

The String class now has a hash32 () method, the result of which is cached in the int field hash32. The result of the hash32 () method for the same lines may be different for different JVM launches (in fact, it will be almost always different, because it uses System.currentTimeMillis () and two System.nanoTime calls for initial initialization). As a result, the iteration on the same collection may be different for different program launches. In fact, I was a little surprised by this method. Why do we need it if the real implementation of the hashCode method works well enough? I decided to check how many collisions will be obtained using the hash32 () method. The String.hash32 () method is not public, but I looked at the implementation of the HashMap class to determine how it calls it. Answer: sun.misc.Hashing.stringHash32 (String). On a test containing one million strings, the method yielded 304 collisions, compared to more than one collision using the String.hashCode method. I think we should wait for further improvements.

New hashing logic can seriously affect multi-threaded code

Oracle made a mistake when implementing hashing in the following classes: HashMap, Hashtable, HashSet, LinkedHashSet, WeakHashMap. Only ConcurentHashMap does not contain this error. The problem is that all non-parallel classes now have a field:

/** *            . */ transient final int hashSeed = sun.misc.Hashing.randomHashSeed(this);

This means that each time the above collections are created, the sun.misc.Hashing.randomHashSeed method will be called. In turn, the randomHashSeed method calls the java.util.Random.nextInt method. As you know, the Random class is not very friendly with multithreading: it contains the private final AtomicLong seed field. Atomics work well with average load, but bad with a large load.

As a result, many high-load web applications processing HTTP / JSON / XML can be affected by this error, because almost all known parsers (analyzers) use at least one of the affected classes. All these parsers can create nested map collections, which will increase the number of created map collections per second.

How to solve this problem?

1. ConcurrentHashMap path: call the randomHashSeed method only when the jdk.map.althashing.threshold system property is defined. However, this is only available for kernel JDK developers.

 /** *            . */ private transient final int hashSeed = randomHashSeed(this); private static int randomHashSeed(ConcurrentHashMap instance) { if (sun.misc.VM.isBooted() && Holder.ALTERNATIVE_HASHING) { return sun.misc.Hashing.randomHashSeed(instance); } return 0; }

2. Hacker path: change class sun.misc.Hashing. It is not recommended. If you do decide to do this, you can do this: the java.util.Random class is not final. You can extend it and override its nextInt method by returning, for example, a constant value. Then you need to change the sun.misc.Hashing.Holder.SEED_MAKER field by setting it to your class inherited from Random. Do not worry that this field is private, static and final - a reflection will help you.

Summary

Starting with Java 1.7.0_06, the String.substring method always creates a new array of char [] characters for each string it creates. This means that now this method has linear complexity in comparison with the constant one as in previous versions. As a result of this change, the string takes 4 bytes less than before, and memory leaks caused by the String.substring method are guaranteed to be eliminated.
Starting from the same update, the second method for hashing, called hash32, appeared in the String class. This method is not public and can be accessed without reflection only through a call to the sun.misc.Hashing.stringHash32 (String) method. This method is used in JDK 7 hashing-based collections if their size exceeds the value of the jdk.map.althashing.threshold system property. This is an experimental method, and I do not recommend using it in your code.
Starting with Java 1.7.0_06, all standard non-threaded collections have an undesirable side effect caused by the new hashing algorithm. This error manifests itself only in multi-threaded applications that create multiple map collections per second.

Push

The fact that strings in Java separated an array of characters indicates that they were persistent. As is well known, persistent structures are structures that retain all their previous versions when they change. This was done explicitly, as an optimization to reduce the amount of memory used + to make the substring method run time constant. However, the benefits of this optimization are not so great, which is why this idea was abandoned in version 7 of Java.

In .NET, the strings were initially not persistent. Here is a link to an interesting article by Eric Lippert in which he explained in detail why this is so.

Thanks for reading.
I hope the article was helpful.

Source: https://habr.com/ru/post/178409/

All Articles