Algorithms and JDK data structures

[ ]
Periodically checking if there is any implementation of one or another standard algorithm in jdk, I got the idea to create a similar review. Also interesting were the reasons for the presence / absence of many well-known data structures.
The format of the review - only the key properties and features of the structures and algorithms in jdk, details and details - are painted in javadoc or easily found in the source.
I hope for constructive criticism and collective intelligence if I missed it.
Enough intro, so let's see what the current jdk 7 includes and why.

Structures

Stack

In jdk, there is an old stack ( Stack ), which exists since the release of java and is no longer recommended to use, it is complex and strange: inherited from Vector , which means it is built on a dynamically expandable array and synchronized.
Why this is all the usual stack and why this is not the interface is not entirely clear (discussed more than once: 1 , 2 ), but it looks like a design error, the same as Vector itself.
Also in the javadoc itself, it is advised to use Deque instead .

Deque is an interface (api) of a two-way queue (LIFO + FIFO for O (1)), which includes stack operations (push, pop, isEmpty, size). Available in jdk not too long ago (1.6+).
Of course, it would be easier if these stack operations were in the Stack interface, and Deque, for example, would inherit it, but since Stack was already present, and backward compatibility is for java “our everything”, we had to sacrifice normal design.
Deque implementations are ArrayDeque and LinkedList , which in combination also implement the usual queue, so we will look at it later.
')

Queues

Next in order we look at the queue. Everything is good here, the design is decent.
Queue - interface (api) FIFO of the queue, adding to the beginning, deleting from the end for O (1).

The main implementations are ArrayDeque , a cyclic buffer based on a dynamically expanding array (doubled when filled) and LinkedList , a classic doubly linked list , the size is unlimited. In a strange way, the first one does not support random access (add / remove / get with an index), the second one does, but during the O (n) iteration through the linked list.
The same implementations implement the mentioned bilateral queue, which means removal from the end and addition to the beginning - also O (1).

Then c jdk 1.5+ was added to the PriorityQueue, which in fact violates the queue contract because the elements are not taken from the end (they are also not put in the beginning), but according to their priorities.
It is built on the basis of an expandable binary heap , at the top is the minimum element (according to its comparator), when filled, it increases in size by one and a half times. The corresponding add / delete is O (log N), the reference to the minimum (head) is O (1).

The remaining queue types are for multi-threaded use, such as BlockingQueue, TransferQueue, ConcurrentLinkedQueue, and ConcurrentLinkedDeque.

Implementations BlockingQueue (ArrayBlockingQueue, LinkedBlockingQueue, PriorityBlockingQueue) are a kind of synchronized versions of their originals, i.e. almost every operation is performed synchronously (blocked). This also includes the DelayQueue - also synchronized, using PriorityQueue inside.

While SynchronousQueue, TransferQueue (LinkedTransferQueue), ConcurrentLinkedQueue, ConcurrentLinkedDeque - are based on a different approach: they use non-blocking queue algorithms on linked lists using CAS instructions that are well parallelized in a multiprocessor environment. A detailed description is in the source.
The algorithms of this class are quite extensive and new material, not yet completely standardized and structured, therefore it is beyond the scope of this review and rather a topic for a separate article.

Priority queues (heaps)

As already mentioned, since 1.5, there is a universal PriorityQueue, in addition there is another heap implementation in jdk. This is the good old Timer, internal TaskQueue classic (on top of it is a task with minimal latency). Naturally, this is a closed implementation and it cannot be used except inside the timer.

Lists

As you know there are sequential and / or random access.
In java, the list is a List and 2 main implementations, the first is an ArrayList, supports random access, based on a dynamically expandable array (increases by one and a half times when filled), but does not shrink itself when deleting elements, you need to call the provided method (trimToSize ).

And the second is the already mentioned LinkedList, a doubly linked list of sequential access, the size is limited only by free jvm memory. Although random access methods (by index) are also present - as already mentioned, they are executed here for O (n)

For some reason, there is no simplest simply-connected list in java collections, although it probably wouldn’t hurt (2 times less overhead for links), as well as a simple stack.

To use lists in a multithreaded environment, there is a CopyOnWriteArrayList (change operations - O (n)), a wrapper (synchonizedList), and also an outdated Vector.

Character tables

Represented by a binary tree and a hash table.

The tree is the TreeMap (TreeSet), the SortedMap implementation (SortedSet, respectively), is based on the classic red-black tree , i.e. Balanced and basic operations guaranteed for O (log N), unlimited in size.
There are no other types of trees in jdk.

Hash table - HashMap (HashSet) is probably the most used structure in java, built on a dynamically expanding hash table with chains , with all the consequences: performance depends on the hash function, in the worst case O (N). When the size reaches the specified loadFactor - doubles. It is also worth noting that double hashing is used to protect against bad hash functions, in which sly bit arithmetic is applied to the result of the hashCode () call.

There are jdk implementations of hash tables and with open addressing (linear probing) . One of them is IdentityHashMap, which also uses classic linear probing optimization when both keys and values are stored in the same array next to each other, for better data caching (javadoc: "better locality for large tables than arrays")
The second implementation is very specific: it serves only to store ThreadLocal elements (internal hidden classic ThreadLocalMap) and is certainly not available for use.

Multi-threaded versions: ConcurrentHashMap, synchronizedMap, Hashtable, and ConcurrentSkipListMap. A wrapper is naturally just a blocking version of a regular HashMap, Hashtable is the same, ConcurrentHashMap is a lock-striping version that allows for the reduction of critical sections (read about this better in JCiP , here’s an excerpt ).
ConcurrentSkipListMap is a non-blocking version of the algorithm of the same name adapted for the hash table (for more details, see the source code).

Set sets (not allowing duplicates) are implemented through hash tables, so all that is said to hash tables is true for Set.

Counts

Graph structures and algorithms are not represented in jdk. Therefore, in this case, it remains to use only third-party libraries.

Strings

The standard implementation is based on an array of unicode characters. It is worth recalling that since version 1.7_17, performance substring is now O (n), since the substring is copied.

It is interesting that to search for a substring, a simple search algorithm gives O (N * M) in the worst case, and not some effective algorithm built on a state machine (Knuth-Morris-Pratt, etc.).
There are several reasons: firstly, again, the large size of the alphabet of UTF characters (~ 65K), which means large overhead costs for storing the finite state machine, while the brute force algorithm is in-place (no additional memory is used).
And secondly, the performance on average lines - in this, apparently, the search does not lose much to other algorithms.

The same with sorting. There are effective row sorts by counting (LSD, MSD, etc.), but jdk uses the standard for objects, giving O (M * N * log N) if most of the lines are not very different (M is the average length of the lines).
The reason is the same: sorting by counting uses auxiliary arrays of the size of the alphabet UTF, which makes them ineffective on average input data.

Algorithms

Sorting

In jdk7, there have been many changes regarding sorting options, the topic has been discussed repeatedly, the information and articles on this topic are full, I can advise this review in more detail.
In short, the current list of sorting implementations currently available in jdk: TimSort - sorting objects by default, merging - also for objects, the old version (enabled through the system property), for primitives used Dual-Pivot Quick sort , for byte / character arrays counting sorting is used; for small arrays, in all cases sorting is used.

Sorting collections Collections.sort (List ...) is still done through copying to an array, sorting and then overwriting the collection. Therefore, it is impossible to do this without standard costs, although it would probably do well to have in-place sorting of linked lists .

The object version is also used to sort the rows, the reasons are mentioned above.

Search

Traditional binary search is present for all arrays of primitives and objects as well as lists supporting random access.

Moreover in Collections there is a version for linked lists. It turns out that sorting for linked lists didn’t make sense, but the binary search was needed, although it makes even less sense since O (log N) performance isn’t close there.

Regular expressions

A traditional implementation based on non-deterministic finite state machine ( NFA ) and backtracking is used . Those. exponential complexity O (m ^ N) in the worst case on degenerate values, example:

"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa".matches("(a|aa)*b")

Also has a so-called. “Ordered alternation” is when the search stops immediately after finding the first match from several and not the most specific (long) match, for example .

Hash functions, checksums

The hashCode calculation algorithms in jdk are as many as six, Park-Miller_random_number_generator is used by default, a more recent article on Habré is used in more detail.

There are also standard industrial hashing algorithms (SHA- *, MD5 and variations) - MessageDigest

To calculate the checksums, there are also implementations of the Adler-32 ( javadoc ) and CRC32 ( javadoc ) algorithms

Compression

In jdk there is an implementation of the standard compression algorithm deflate ( Deflater ) and also based on it zip / gzip. All this in java.util.zip

Conclusion

As you can see, the classic data structures in java are far from complete, but at the same time there are several options for thread-safe versions for almost everyone.
What is missing is an open question. For example, it is possible to argue whether some Union-Find is needed in jdk, but the absence of graph structures and algorithms in the age of social networks in a language that claims to be the most universal is surprising, ~~bugs and bicycles are~~ surprising.

Source: https://habr.com/ru/post/182776/

All Articles