Trove 4.0? Primitives in the standard framework of collections from Java 8

About a month ago on Habré there was an article about Trove - the most frequently mentioned library when asked about data structures with primitives in Java. About a couple of days before that, I sat down to rewrite this library. The time is definitely over, so I’m sharing the search with you, although I don’t really hope that someone will continue this business.

At the moment, there are 6 types of hash tables: sets of primitives, objects, and all 4 variants of maps: primitive - primitive, primitive - object, object - primitive, and object - object, over which hangs a cloud of generalizing interfaces.

I have always wondered why all such libraries create another type hierarchy, rather than being built into the long- established standard framework of Java collections. I have not seen or see any fundamental problems with this. Therefore, java.lang.Iterable , java.util.Collection and java.util.Map rise above my cloud of interfaces, like on the pantheon. I knowingly gave references to the Java 8 documentation. Almost all methods from future interfaces have been implemented, except for the spliterator() . You can start to get used to.

Additional elements of the new framework Trove

mapIterator () is a hybrid spliterator() from Java 8 and entrySet().iterator() .

 interface TMapIterator<K, V> { boolean tryAdvance(); K key(); V value(); ... } //   for (IntKeyMapIterator<String> it = map.mapIterator(); it.tryAdvance(); ) { doSomething(it.intKey(), it.value()); }

This is hardly more cumbersome than entrySet().iterator() , but much more efficient.
')
In all object hashes, there are methods to control the splitting of objects into equivalent sets:

 class DHashSet<E> { ... protected boolean elementsEquals(@NotNull E a, @Nullable E b) { return a.equals(b); } protected int elementHashCode(@NotNull E element) { return element.hashCode(); } }

“Reversible” methods over entire collections: addAllTo() , allContainingIn() , removeAllFrom() - to speed up stream operations. In the standard framework, the method is always called on the collection, which will change, although it will have to be iterated over the second one. Although it is the latter that “knows” how best to go through all the elements.

In some cases, the reversing pass can be performed 2 times faster than the external one, using a regular iterator. For example, when dumping sets into a list.

Since the contracts of the corresponding methods from the standard framework have rather subtle nuances, it is not recommended to call the reverse methods manually so as not to do something wrong. They are already called from within the usual methods in implementations.

Implementation

The heart of Trove is open-addressing and double-hashing hash tables for resolving collisions. When you delete an item in them, you cannot simply mark a cell as free. Therefore, if elements are regularly removed from the table and added, it is completely clogged with occupied and “deleted” cells, the speed of operations drops. In the previous version of Trove, in order to prevent this, there was a deletions counter, heuristically tied to a given table occupancy rate.

With the coefficient k (from 0 to 1) it turns out that with constant deletions, before rebuilding the table, there are k - k ^ 2 filled cells, and k ^ 2 deleted, with alternate deletions and insertions of different random elements - k filled and (1 - k) ^ 2 - free (all values - in fractions from 1).

Substitute the numbers. With a default factor of 0.5, in both cases, a maximum of a quarter of deleted items in the table is obtained. Very beautiful. But with my favorite ratio of 0.8 - with a completely normal pattern for using a table, only 4% of free cells can remain - this is a crime.

Therefore, I introduced a much simpler principle - the number of deleted cells at any moment should not exceed either the number of employees (that is, the size of the table) or the number of free ones, otherwise a rebuilding occurs:

 class DoubleHashBase { int size, free, removed, modCount; ... final void postRemoveHook() { modCount++; if ( ( --size < ++removed || removed > free ) && autoRehashEnabled ) { rehash(); } }

The key method in the implementation of the hash table is the item search. In the case of a double-hash table, the array, the length of which is a prime number, jumps in an almost random order, and compares the element at the current index with the desired one. At the same time looking for an element in the hash table, either in order to insert it there then, or to take a value from a parallel array ( map.get() or set.contains() ). In the first case, we can assume that the element in the table, most likely, is not yet, and in the second - that it most likely already lies there. In the previous implementation of the Trove, in both variants of the search, it is first of all checked to see if the cell is empty, and then whether what we are looking for lies there. Although when searching for "taking", it is necessary first of all to check whether the current element is not equal to the desired one (yes it is).

This two-line change sped up IntHashSet.contains(int) by 5 ns.

Another attempt to write normally benchmarks on JMH

CharHashSet CharHashSet , fill factor 0.5, the table takes about 4 KB in memory. Details omitted:

  @GenerateMicroBenchmark public char charSet_simple(CharSetState state) { char sum = 0; TCharIterator it = state.iterator; while (it.hasNext()) { sum += it.nextChar(); } return sum; } @GenerateMicroBenchmark public char charSet_forEachLoop(CharSetState state) { char sum = 0; for (char v : state) { sum += v; } return sum; } public static class CharSum implements CharConsumer { public char sum; public void accept( char b ) { sum += b; } } @GenerateMicroBenchmark public char charSet_forEachFunction(CharSetState state) { CharSum procedure = new CharSum(); state.set.forEach( procedure ); return procedure.sum; } @GenerateMicroBenchmark public char charSet_tryAdvance(CharSetState state) { char sum = 0; TCharIterator it = state.iterator; while (it.tryAdvance()) { sum += it.charValue(); } return sum; }

Result:

  Mean Mean error charSet_forEachFunction 6,7 1,1 nsec/iter charSet_forEachLoop 15,4 2,1 nsec/iter charSet_simple 8,7 0,1 nsec/iter charSet_tryAdvance 8,6 0,1 nsec/iter

As you can see, for the pleasure of using the shortest method, foreach jlIterable, you still have to pay.

The “simple” method is about 30% faster than the similar one in the previous version of Trove, due to the fact that the next element was searched 2 times at each iteration: in the hasNext() method, and then in next() . Now the index of the next element is stored in the field of the iterator class.

Why all this

As I said, I have absolutely no time to finish the library:
No listings ( CharArrayList )
Serializable support temporarily removed from all classes
There is almost no documentation, just a little copy-paste from Java 8.

But, since the new framework is consistent with the standard one, it is very easy to test it on real-world tasks, especially if you wrote prudently

 Map<Integer, Long> map = new HashMap<>();

Although I did not try to compile a new library with anything other than Java 7u25, it should be compatible with Java 6 from source (and build).

The repository includes a very detailed bundle for development for Intellij: bitbucket.org/leventov/trove/src

I would be glad to hear your thoughts about the design of such libraries and the implementation of hashes.

By reference to the original - (expected) discussion in English.

Source: https://habr.com/ru/post/192186/

All Articles

Trove 4.0? Primitives in the standard framework of collections from Java 8

Additional elements of the new framework Trove

Implementation

Another attempt to write normally benchmarks on JMH

Why all this

More articles: