Greetings, in this post I will spread out how to speed up the creation of org.springframework.util.ConcurrentReferenceHashMap
with insignificant efforts.
Interested in pumping performance? Welcome!
We begin, of course, with measurements and try to understand what we are going to improve. To do this, take JMH 1.21, JDK 8 and JDK 11, as well as async-profiler .
To find out how much it takes to create an empty dictionary, let's put a simple experience:
@Benchmark public Object original() { return new ConcurrentReferenceHashMap(); }
Profile looks like this:
55.21% 2429743 osuConcurrentReferenceHashMap.calculateShift 20.30% 891404 osuConcurrentReferenceHashMap$Segment.<init> 8.79% 387198 osuConcurrentReferenceHashMap.<init> 3.35% 147651 java.util.concurrent.locks.ReentrantLock.<init> 2.34% 102804 java.lang.ref.ReferenceQueue.<init> 1.61% 70748 osuConcurrentReferenceHashMap.createReferenceManager 1.53% 67265 osuConcurrentReferenceHashMap$Segment.createReferenceArray 0.78% 34493 java.lang.ref.ReferenceQueue$Lock.<init> 0.76% 33546 osuConcurrentReferenceHashMap$ReferenceManager.<init> 0.36% 15948 osuAssert.isTrue
The direction is clear, you can proceed.
So, we spend the lion's share of the time in the method calculateShift
. Here he is:
protected static int calculateShift(int minimumValue, int maximumValue) { int shift = 0; int value = 1; while (value < minimumValue && value < maximumValue) { value <<= 1; shift++; } return shift; }
It’s hard to come up with something new, so let's switch to using it:
public ConcurrentReferenceHashMap(/*...*/ int concurrencyLevel, /*...*/) { //... this.shift = calculateShift(concurrencyLevel, MAXIMUM_CONCURRENCY_LEVEL); //... } // ConcurrentReferenceHashMap$Segment public Segment(int initialCapacity) { this.referenceManager = createReferenceManager(); this.initialSize = 1 << calculateShift(initialCapacity, MAXIMUM_SEGMENT_SIZE); this.references = createReferenceArray(this.initialSize); this.resizeThreshold = (int) (this.references.length * getLoadFactor()); }
Note the use of the Segment
constructor:
int roundedUpSegmentCapacity = (int) ((initialCapacity + size - 1L) / size); //... for (int i = 0; i < this.segments.length; i++) { this.segments[i] = new Segment(roundedUpSegmentCapacity); }
The value roundedUpSegmentCapacity
constantly in the loop, therefore the expression 1 << calculateShift(initialCapacity, MAXIMUM_SEGMENT_SIZE)
, executed in the Segment
constructor, will also always be constant. Thus, we can move the specified expression out of the constructor and the cycle.
The same statement is true for the expression (int) (this.references.length * getLoadFactor())
, because the references
array is created using the variable initialCapacity
and its size is constant when creating each segment. Bring the expression out of the constructor and the loop.
Consider the createReferenceArray
method:
private Reference<K, V>[] createReferenceArray(int size) { return (Reference<K, V>[]) Array.newInstance(Reference.class, size); }
Using Array::newInstance
obviously redundant, nothing prevents us from creating an array using a constructor:
private Reference<K, V>[] createReferenceArray(int size) { return new Reference[size]; }
The constructor's performance is not inferior to the Array::newInstance
call at the C2 level, but significantly exceeds it for small arrays in C1 modes ( -XX:TieredStopAtLevel=1
property -XX:TieredStopAtLevel=1
) and interpreter ( -Xint
property):
//C2 length Mode Cnt Score Error Units constructor 10 avgt 50 5,6 ± 0,0 ns/op constructor 100 avgt 50 29,7 ± 0,1 ns/op constructor 1000 avgt 50 242,7 ± 1,3 ns/op newInstance 10 avgt 50 5,5 ± 0,0 ns/op newInstance 100 avgt 50 29,7 ± 0,1 ns/op newInstance 1000 avgt 50 249,3 ± 9,6 ns/op //C1 length Mode Cnt Score Error Units constructor 10 avgt 50 6,8 ± 0,1 ns/op constructor 100 avgt 50 36,3 ± 0,6 ns/op constructor 1000 avgt 50 358,6 ± 6,4 ns/op newInstance 10 avgt 50 91,0 ± 2,4 ns/op newInstance 100 avgt 50 127,2 ± 1,8 ns/op newInstance 1000 avgt 50 322,8 ± 7,2 ns/op //-Xint length Mode Cnt Score Error Units constructor 10 avgt 50 126,3 ± 5,9 ns/op constructor 100 avgt 50 154,7 ± 2,6 ns/op constructor 1000 avgt 50 364,2 ± 6,2 ns/op newInstance 10 avgt 50 251,2 ± 11,3 ns/op newInstance 100 avgt 50 287,5 ± 11,4 ns/op newInstance 1000 avgt 50 486,5 ± 8,5 ns/op
The replacement will not be reflected on our benchmark, but it will speed up the code when the application is started, when C2 has not yet worked. More about this mode will be discussed at the end of the article.
Refer again to the ConcurrentReferenceHashMap
constructor.
ConcurrentReferenceHashMap(/*...*/) { Assert.isTrue(initialCapacity >= 0, "Initial capacity must not be negative"); Assert.isTrue(loadFactor > 0f, "Load factor must be positive"); Assert.isTrue(concurrencyLevel > 0, "Concurrency level must be positive"); Assert.notNull(referenceType, "Reference type must not be null"); this.loadFactor = loadFactor; this.shift = calculateShift(concurrencyLevel, MAXIMUM_CONCURRENCY_LEVEL); int size = 1 << this.shift; this.referenceType = referenceType; int roundedUpSegmentCapacity = (int) ((initialCapacity + size - 1L) / size); this.segments = (Segment[]) Array.newInstance(Segment.class, size); for (int i = 0; i < this.segments.length; i++) { this.segments[i] = new Segment(roundedUpSegmentCapacity); } }
From curious for us: replacing Array.newInstance
with a constructor leads to a compilation error, we pass by. But the cycle is very curious, more precisely, referring to the segments
field. To make sure how destructive (sometimes) for performance such an appeal can be, I recommend the article by Nitzan Vakarta The volatile read suprise .
The case described in the article, it seems to me, corresponds with the code in question. Focus on the segments:
this.segments = (Segment[]) Array.newInstance(Segment.class, size); for (int i = 0; i < this.segments.length; i++) { this.segments[i] = new Segment(roundedUpSegmentCapacity); }
Immediately after creating an array, it is recorded in the ConcurrentReferenceHashMap.segments
field, and it is with this field that the loop interacts. Inside the Segment constructor, an entry occurs in the volatile references
field:
private volatile Reference<K, V>[] references; public Segment(int initialCapacity) { //... this.references = createReferenceArray(this.initialSize); //... }
This means that it is impossible to improve the access to the segments
field, in other words, its contents are read at each turn of the cycle. How to check the truth of this statement? The simplest way is to copy the code into a separate package and remove volatile
from the declaration of the Segment.references
field:
protected final class Segment extends ReentrantLock { // private volatile Reference<K, V>[] references; // private Reference<K, V>[] references; }
Check if something has changed:
@Benchmark public Object original() { return new tsypanov.map.original.ConcurrentReferenceHashMap(); } @Benchmark public Object nonVolatileSegmentReferences() { return new tsypanov.map.nonvolatile.ConcurrentReferenceHashMap(); }
We detect significant performance gains (JDK 8):
Benchmark Mode Cnt Score Error Units original avgt 100 732,1 ± 15,8 ns/op nonVolatileSegmentReferences avgt 100 610,6 ± 15,4 ns/op
On JDK 11, the elapsed time has decreased, but the relative gap has not changed much:
Benchmark Mode Cnt Score Error Units original avgt 100 473,8 ± 11,2 ns/op nonVolatileSegmentReferences avgt 100 401,9 ± 15,5 ns/op
Of course, you need to return volatile
to its place and look for another way. A bottleneck was found - this is an appeal to the field. And if so, then you can create the segments
variable, fill the array, and only then write it into the field:
Segment[] segments = (Segment[]) Array.newInstance(Segment.class, size); for (int i = 0; i < segments.length; i++) { segments[i] = new Segment(roundedUpSegmentCapacity); } this.segments = segments;
As a result, with the help of even such simple improvements, we managed to achieve quite good growth:
JDK 8
Benchmark Mode Cnt Score Error Units originalConcurrentReferenceHashMap avgt 100 712,1 ± 7,2 ns/op patchedConcurrentReferenceHashMap avgt 100 496,5 ± 4,6 ns/op
JDK 11
Benchmark Mode Cnt Score Error Units originalConcurrentReferenceHashMap avgt 100 536,0 ± 8,4 ns/op patchedConcurrentReferenceHashMap avgt 100 486,4 ± 9,3 ns/op
When you start Spring Booth applications from Ideas, developers often set the flag 'Enable launch optimizations', which adds -XX:TieredStopAtLevel=1 -noverify
to the VM arguments, which accelerates the launch by turning off profiling and C2. Let's make measurement with the specified arguments:
// JDK 8 -XX:TieredStopAtLevel=1 -noverify Benchmark Mode Cnt Score Error Units originalConcurrentReferenceHashMap avgt 100 1920,9 ± 24,2 ns/op patchedConcurrentReferenceHashMap avgt 100 592,0 ± 25,4 ns/op // JDK 11 -XX:TieredStopAtLevel=1 -noverify Benchmark Mode Cnt Score Error Units originalConcurrentReferenceHashMap avgt 100 1838,9 ± 8,0 ns/op patchedConcurrentReferenceHashMap avgt 100 549,7 ± 6,7 ns/op
More than 3-fold increase!
In particular, this is needed to speed up queries that return projections to Spring Data JPA.
The JMC profile shows that creating a ConcurrentReferenceHashMap
takes almost a fifth of the time it takes to execute a query.
public interface SimpleEntityRepository extends JpaRepository<SimpleEntity, Long> { List<HasIdAndName> findAllByName(String name); }
where HasIdAndName
is a view projection
public interface HasIdAndName { int getId(); String getName(); }
Also, ConcurrentReferenceHashMap
several dozen times in the Spring code, so there will definitely not be superfluous.
Changes:
https://github.com/spring-projects/spring-framework/pull/1873
https://github.com/spring-projects/spring-framework/pull/2051
Source: https://habr.com/ru/post/432824/