What hash-set is for

The hash-set data structure (and set in general) is such a rare structure that it is not present in all standard libraries. In the .NET Framework, for example, it appeared only from version 3.5. However, in some cases a hash-set can be quite useful.

In this article I will show when it makes sense to use a hash-set and what you will have to pay for.

')
First, what set is. This is such a collection, elements of which are disordered and unique (that is, they are present in the collection no more than once). Set supports basic operations:

Add item.
Delete item.
Check if item exists.
Enumerate all items.

Implementations of set differ only in the data structure used for the search — it can be an array, a list, a tree, a hash table, or whatever. In practice, a hash table is most often used, since it gives maximum performance with acceptable overhead.

Now consider an example. Suppose we need to implement the grouping of a large number of elements according to the equality of the list of tags, and ignore list is defined for tags - that is, tags that should not participate in the comparison. Tags, in this case, are short strings (mostly words). Suppose also that ignore list contains about 100 tags.

On Ruby, the algorithm is as follows:

def group_elements(elements, ignore_list) groups = [] for element in elements assigned = false for group in groups #   ,     true; #  ,  false. if group.accept(element, ignore_list) assigned = true break end end #      , #     . if !assigned groups << Group.new(element) end end return groups end

Here is how the Group :: accept () method is implemented:

 class Group ... def accept(element, ignore_list) for my_element in @elements if my_element.match(element) @elements << element return true end end return false end ... end

And finally, the Element :: match () method:

 class Element ... def match(element, ignore_list) #  . my_tags_filtered = @tags.select { |tag| !ignore_list.include?(tag) } its_tags_filtered = element.tags.select { |tag| !ignore_list.include?(tag) } #  . intersection = my_tags_filtered & its_tags_filtered #        #      . return intersection.length > 0 && intersection.length == [my_tags_filtered.length, its_tags_filtered.length].max end ... end

Here, the implementation is shown in the forehead, where both the tag lists themselves and ignore list are normal dynamic arrays. That is, the operation ignore_list.include? has linear complexity. In addition, we recall that tags are strings with an average length of 5-10 characters, and the comparison of strings itself has linear complexity. Considering that grouping has O (n ² ) complexity, optimization of basic comparison operations can give a significant performance boost.

The hash-set helps to make the implementation more optimal. If we implement ignore list with it, we will get two serious advantages. First, we will get rid of the linear search in the list, replacing it with the search for a hash-key; secondly, we get rid of string comparisons. There is a true one nuance - in Ruby there is no hash-set data structure. However, this is not a problem, because a hash-set can replace a regular hash table in which elements are used as keys (you can use any type as small as possible as values). This strange technique applies, for example, in NHibernate, and works well if there is no better.

I compared two implementations of the above Ruby algorithm with the number of elements in the order of 2000-3000, the tag size of 2-15 UTF-8 characters and the number of tags per element, on average, 5. Replacing the hash-set array produced a performance gain of about 10 time.

In general, hash-sets allow for more optimal implementation of operations on sets - intersection, union and subtraction. However, do not forget about their redundancy. Hash-sets reserve significantly more memory than they need to store their elements, so it makes sense to use them only for medium-sized sets (100-10000 elements). In fact, you spend memory to get faster calculations. For large sets, trees should be used.

Source: https://habr.com/ru/post/126169/

All Articles

What hash-set is for

More articles: