⬆️ ⬇️

Sorting ... with a hash table (also counting with a tree and a HashMap)

Three days ago, I was thinking about combining sorting with counting and wood. After discussing it with a colleague, we came up with the following solution: instead of using a TreeSet, use a HashMap (if you have a TreeSet here, you can see below) But even that seemed to me a little, so I decided to implement my own hash table and see what would come of it. The results seemed quite interesting to me.



All three types of sorting are great for cases where the power of the set of unique elements in the array is relatively small (which is understood by this, it will be clear after we look at the test results).



Sort by tree count



We build a tree of Pairs (Key, Quantity), where the Key is responsible for the element of the array, and Quantity is the number of repetitions of this element of the array. The tree is naturally balanced, black and red, for example.



Then everything is logical. We add all the elements of the array to the Pair, and we search for the Pair in the tree (to avoid the re-creation of objects, we use the previously created Pair, in which we change the Key. Here, the Number does not interest us, since we are looking only for matching Key). If there is already such a Key, increase the Quantity, otherwise add a new Pair (Key, 1).

')

We rewrite the array, deleting the vertex each time and writing down the Key as many times as its Quantity.



In order not to implement the tree itself, for the described algorithm I used a TreeSet that works on a black and red tree.



Using HashMap



As the key, we will store the array element, as the value by key - the number of repetitions.



Using my own hash table



I decided to resort to the method of open addressing. In the hash table, we store objects of the class Para, where first is the key, second is the number of repetitions. Naturally, a constructor was added that takes the initial size of the table and the load factor. The hash function is the simplest: we take the key modulo, in the case of repeated access we add one (the unit is well suited to the fact that the key array turns out to be the most ordered, and in the end we have to sort all the keys by standard quick sorting, because in the table they may turn out to be out of order ). So it can be said that this is a hashing and one more sorting union.



Collisions



Obviously, collisions can occur and this will have a bad effect on both the hash definition time and the sorting speed of the final hash array (ordered or nearly ordered data is sorted faster). But here's the thing: our hash function changes as the table expands. So specifically to pick up such data, it becomes much more difficult. Moreover, you can make random (in a certain range, of course) the load factor and expansion coefficient, which will make it impossible to pre-select values ​​that lead to collisions. Well, if in one table some data led to collisions, then after rehashing, the number of collisions can be significantly reduced (several times). The factorials of numbers will be a weak point, but they are incredibly small (up to 2 ^ 31, for example, there are only 11 (not taking into account 0)).



We will not accept cryptographic hash -f-because we can forget about an almost ordered array of hashes (in the sense of ordering by keys).



Working hours



Sorting by tree-counting: in the worst case, O (n log n), at best, in linear time.

Sorting by the hash table: time for hashing + complexity of sorting the array of hashes (may vary depending on the chosen algorithm and implementation, so I will not indicate anything here). It is difficult to estimate the complexity due to the use of a hash function and possible different approaches to sorting hashes. This question needs additional research, but it is obvious that in the worst case (when the input data are deliberately selected for a specific hash function at a specific stage of execution), the operation time will be O (n ^ 2).



What is the memory?



Sorting by tree-counting will require O (distinct (n)) additional memory.

For the hash table, we need O (distinct (n)) memory + memory to sort the hashes (depending on the chosen algorithm).



Here are the results in milliseconds that I got on random numbers when sorting an array of objects into 10 million items, with a range of values ​​[0; x]:



On load factor tests in my hash table = 0.75, initial capacity = 20, the table is doubled each time



When x = 10:

2044 - built

285 - counting tree (Usatov sort)

276 - HashMap (Usatov-Prokurat sort)

140 - my hash table (Usatov-Prokurat sort using MyHashTable)



x = 100:

2406 - built

455 - counting-tree

283 - HashMap

134 - hash table



x = 1_000:

2171 - built

930 - counting-tree

380 - HashMap

209 - hash table



x = 10_000

2879 - built

1666 - counting-tree

634 - HashMap

326 - hash table



x = 100_000

4045 - built-in

2899 - counting-tree

866 - HashMap

762 - hash table



x = 1_000_000

4997 - built-in

5762 - counting-tree

2505- HashMap

1294 - hash table



x = 10_000_000

5083 - built-in

11480 - counting-tree

5099 - HashMap

3240 - hash table



As I said at the beginning, these sortings are best suited for those cases where the power of the set of array elements is sufficiently small relative to the size of the array.



It is also worth noting that the built-in sorting works much faster on ordered (almost ordered) data (or an array in reverse order), but my method does more quickly on random numbers.



Sort by tree count:



static void usatovSort(Integer[] arr) { TreeSet<MyPair> tree = new TreeSet<>(); MyPair temp; MyPair mp = new MyPair(); for (int i : arr) { temp = mp; temp.first = i; temp = tree.ceiling(temp); if (temp != null && temp.first == i) //   , .    ,      temp.second++; else tree.add(new MyPair(i, 1)); } int ptr = 0; while (!tree.isEmpty()) { temp = tree.pollFirst(); for (int i = 0; i < temp.second; i++) arr[ptr++] = temp.first; } } 


Sort by HashMap:



  static void usatovProkuratSortUsingHashMap(Integer[] arr) { HashMap<Integer, Integer> hm = new HashMap<>(); Integer temp; for (Integer i : arr) { temp = hm.get(i); if (temp == null) hm.put(i, 1); else hm.put(i, temp + 1); } ArrayList<Integer> keys = new ArrayList<>(hm.keySet().size()); keys.addAll(hm.keySet()); keys.sort(Comparator.naturalOrder()); int ptr = 0; for (Integer i : keys) for (int j = 0; j < hm.get(i); j++) arr[ptr++] = i; } 


Sort through my hash table:



  static void usatovProkuratSortUsingMyHashTable(Integer[] arr) { MyHashTable mht = new MyHashTable(); for (Integer i : arr) mht.add(i); MyPair[] res = mht.getPairs(); int ptr = 0; Arrays.sort(res, Comparator.comparingInt(o -> o.first)); for (MyPair mp : res) for (int i = 0; i < mp.second; i++) arr[ptr++] = mp.first; } 


Hash table implementation:



 public class MyHashTable { private MyPair[] hashArr; private double count = 0; private double loadFactor = 0.5; private double expansivity = 2; private static final int DEFAULT_CAPACITY = 20; public MyHashTable() { hashArr = new MyPair[DEFAULT_CAPACITY]; } public MyHashTable(double loadFactor) { hashArr = new MyPair[DEFAULT_CAPACITY]; this.loadFactor = loadFactor; } public MyHashTable(int capacity) { hashArr = new MyPair[capacity]; } public MyHashTable(int capacity, double loadFactor) { hashArr = new MyPair[capacity]; this.loadFactor = loadFactor; } public MyHashTable(int capacity, double loadFactor, double expansivity) { hashArr = new MyPair[capacity]; this.loadFactor = loadFactor; this.expansivity = expansivity; } public MyPair[] getPairs() { MyPair[] pairs = new MyPair[(int) count]; int ptr = 0; for (MyPair i : hashArr) if (i != null) pairs[ptr++] = i; return pairs; } public MyPair get(int key) { int add = 0; while (true) if (hashArr[(key + add) % hashArr.length].first == key) { return hashArr[(key + add) % hashArr.length]; } else if (add++ == hashArr.length) return null; } public void add(int key) { if (count / hashArr.length >= loadFactor) grow(); int add = 0; while (true) { if (hashArr[(key + add) % hashArr.length] == null) { hashArr[(key + add) % hashArr.length] = new MyPair(key, 1); count++; return; } if (hashArr[(key + add) % hashArr.length].first == key) { hashArr[(key + add) % hashArr.length].second++; return; } add++; } } public void add(MyPair newMP) { if (count / hashArr.length >= loadFactor) grow(); int add = 0; while (true) { if (hashArr[(newMP.first + add) % hashArr.length] == null) { hashArr[(newMP.first + add) % hashArr.length] = newMP; count++; return; } if (hashArr[(newMP.first + add) % hashArr.length].first == newMP.first) { hashArr[(newMP.first + add) % hashArr.length].second += newMP.second; return; } add++; } } private void grow() { MyPair[] oldHash = hashArr; hashArr = new MyPair[(int) (expansivity * hashArr.length)]; for (MyPair i : oldHash) if (i != null) innerAdd(i); } private void innerAdd(MyPair mp) { int add = 0; while (true) { if (hashArr[(mp.first + add) % hashArr.length] == null) { hashArr[(mp.first + add) % hashArr.length] = mp; return; } if (hashArr[(mp.first + add) % hashArr.length].first == mp.first) { hashArr[(mp.first + add) % hashArr.length].second += mp.second; return; } add++; } } } 


Class Pair:



 public class MyPair implements Comparable<MyPair> { public int first; public int second; public MyPair() { } public MyPair(int first, int second) { this.first = first; this.second = second; } @Override public int compareTo(MyPair o) { return first - o.first; } @Override public boolean equals(Object o) { if (this == o) return true; if (o == null || getClass() != o.getClass()) return false; MyPair myPair = (MyPair) o; return first == myPair.first; } @Override public int hashCode() { return first; } } 

Source: https://habr.com/ru/post/418355/



All Articles