📜 ⬆️ ⬇️

Hash tables in Go. Implementation details

image


We speculate about the implementation of the map in a language without generics, consider what a hash table is, how it is arranged in Go, what are the pros and cons of this implementation and what you should pay attention to when using this structure.

Details under the cut.



Attention! Who is already familiar with the hash tables in Go, I advise you to skip the basics and go here , otherwise there is a risk of getting tired to the most interesting point.
')

What is a hash table


To begin with, let me remind you what a hash table is. This is a data structure that allows you to store key-value pairs, and, as a rule, has functions:


Hash table in go


The hash table in the go language is represented by the map keyword and can be declared in one of the ways below (more about them later):

  m := make(map[key_type]value_type) m := new(map[key_type]value_type) var m map[key_type]value_type m := map[key_type]value_type{key1: val1, key2: val2} 

Basic operations are performed as follows:


Go to table go


Consider the following program:

 package main import "fmt" func main() { m := map[int]bool{} for i := 0; i < 5; i++ { m[i] = ((i % 2) == 0) } for k, v := range m { fmt.Printf("key: %d, value: %t\n", k, v) } } 

Run 1:

 key: 3, value: false key: 4, value: true key: 0, value: true key: 1, value: false key: 2, value: true 

Run 2:

 key: 4, value: true key: 0, value: true key: 1, value: false key: 2, value: true key: 3, value: false 

As you can see, the output varies from run to run. And all because the map in Go unordered, that is, not ordered. This means that it is not necessary to rely on the order when going round. The reason can be found in the source code of the runtime language:

 // mapiterinit initializes the hiter struct used for ranging over maps. func mapiterinit(t *maptype, h *hmap, it *hiter) {... // decide where to start r := uintptr(fastrand()) ... it.startBucket = r & bucketMask(hB)...} 

Search location is determined randomly , remember this! Rumor has it that so developers runtime make users not to rely on order.

Go table lookup


Consider again a piece of code. Suppose we want to create a pair of "number" - "the number multiplied by 10":

 package main import ( "fmt" ) func main() { m := map[int]int{0: 0, 1: 10} fmt.Println(m, m[0], m[1], m[2]) } 

Run:

 map[0:0 1:10] 0 10 0 

And we see that when we tried to get the value of the two (which we forgot to put) we got 0. We find in the documentation the lines explaining this behavior: the type of the entries in the map. ”, and translated into Russian, this means that when we try to get the value from the map, but it is not there, we get the“ zero value of the type ”, which is the case of the number 0. What to do, if we want to distinguish between cases 0 and absence 2? To do this, they invented a special form “multiple assignment” - a form when, instead of the usual one value, the map returns a pair: the value itself and another boolean that answers the question of whether the requested key is present in the map or not ”

Correctly the previous piece of code will look like this:

 package main import ( "fmt" ) func main() { m := map[int]int{0: 0, 1: 10} m2, ok := m[2] if !ok { // somehow process this case m2 = 20 } fmt.Println(m, m[0], m[1], m2) } 

And when you start we get:

map[0:0 1:10] 0 10 20

Create a table in Go.


Suppose we want to count the number of occurrences of each word in a line, we will create a dictionary for this and go through it.

 package main func main() { var m map[string]int for _, word := range []string{"hello", "world", "from", "the", "best", "language", "in", "the", "world"} { m[word]++ } for k, v := range m { println(k, v) } } 

Do you see a gopher ? - And he is!

image

When trying to launch such a program, we will get a panic and the message “assignment to entry in nil map”. And all because the map is a reference type and it is not enough to declare a variable, it is necessary to initialize it:

 m := make(map[string]int) 

A little lower it will be clear why this works that way. In the beginning, there were already 4 ways to create a map, two of them we considered - this declaration as a variable and creation through make. You can also create using " Composite literals " designs

  map[key_type]value_type{} 

and if desired, even immediately initialize the values, this method is also valid. As for creating with new, in my opinion, it does not make sense, because this function allocates memory for a variable and returns a pointer to it filled with a zero value of the type, which in the case of map will be nil (we will get the same result as in var, more precisely a pointer to it).

How is the map passed to the function?


Suppose we have a function that is trying to change the number that was passed to it. Let's see what happens before and after the call:

 package main func foo(n int) { n = 10 } func main() { n := 15 println("n before foo =", n) foo(n) println("n after foo =", n) } 

An example, I think, is quite obvious, but I will nevertheless include the conclusion:

 n before foo = 15 n after foo = 15 

As you probably guessed, it came to the function n by value, and not by reference, so the original variable did not change.

Let's do a similar trick with mapy:

 package main func foo(m map[int]int) { m[10] = 10 } func main() { m := make(map[int]int) m[10] = 15 println("m[10] before foo =", m[10]) foo(m) println("m[10] after foo =", m[10]) } 

And lo and behold:

 m[10] before foo = 15 m[10] after foo = 10 

The value has changed. “Well, is the map transmitted by reference?”, You ask. Not. There are no links in Go. It is impossible to create 2 variables with 1 address, as in C ++ for example. But you can create 2 variables pointing to one address (but these are already pointers, and they are in Go).

Suppose we have a function fn, which initializes the map m. In the main function, we simply declare the variable, send it to the initialization and see what happened after.

 package main import "fmt" func fn(m map[int]int) { m = make(map[int]int) fmt.Println("m == nil in fn?:", m == nil) } func main() { var m map[int]int fn(m) fmt.Println("m == nil in main?:", m == nil) } 

Conclusion:

m == nil in fn?: false
m == nil in main?: true

So, the variable m was passed by value , so, as in the case of passing an ordinary int to a function, it did not change (the local copy of the value in fn changed). Then why does the value in m itself change? To answer this question, consider the code from the language runtime:

 // A header for a Go map. type hmap struct { // Note: the format of the hmap is also encoded in cmd/compile/internal/gc/reflect.go. // Make sure this stays in sync with the compiler's definition. count int // # live cells == size of map. Must be first (used by len() builtin) flags uint8 B uint8 // log_2 of # of buckets (can hold up to loadFactor * 2^B items) noverflow uint16 // approximate number of overflow buckets; see incrnoverflow for details hash0 uint32 // hash seed buckets unsafe.Pointer // array of 2^B Buckets. may be nil if count==0. oldbuckets unsafe.Pointer // previous bucket array of half the size, non-nil only when growing nevacuate uintptr // progress counter for evacuation (buckets less than this have been evacuated) extra *mapextra // optional fields } 

A map in Go is just a pointer to a hmap structure. This is the answer to the question of why, while the map is passed to a function by value, the values ​​themselves that are contained in it change - the whole point is in the pointer. The hmap structure also contains the following: the number of elements, the number of “buckets” (represented as a logarithm to speed up calculations), seed for randomization of hashes (to make it harder to add, try to find keys so that there are continuous collisions), all sorts of service fields and the main pointer to the buckets where the values ​​are stored. Let's look at the drawing:

image

In the picture there is a schematic representation of the structure in memory - there is a hmap header, a pointer to which is a map in Go (it is created when declared with var, but is not initialized, which causes the program to crash when trying to insert). The field buckets is a repository of key-value pairs, there are several such “buckets”, each one contains 8 pairs. First, in the “bucket” there are slots for extra bits of hashes (e0..e7 is called e - because extra hash bits). Next are the keys and values ​​as first a list of all keys, then a list of all values.

By the hash function is determined in which "bucket" we put the value, inside each "bucket" can lie up to 8 collisions, at the end of each "bucket" there is a pointer to an additional one, if the previous one suddenly overflowed.

How does the map grow?


In the source code you can find the line:

  // Maximum average load of a bucket that triggers growth is 6.5. 

that is, if in each “bucket” an average of more than 6.5 elements, an increase in the array of buckets occurs. In this case, the array is allocated 2 times larger, and the old data is copied into it in small portions every insertion or deletion, so as not to create very large delays. Therefore, all operations will be slightly slower in the process of data evacuation (when searching, too, we have to look in two places). After a successful evacuation, new data begins to be used.

image

Taking the address of the map element.


Another interesting point is that at the very beginning of using the language I wanted to do this:

 package main import ( "fmt" ) func main() { m := make(map[int]int) m[1] = 10 a := &m[1] fmt.Println(m[1], *a) } 

But Go says: “cannot take the address of m [1]”. The explanation of why it is impossible to take the address of the value lies in the procedure of data evacuation. Imagine that we took the address of the value, and then the map grew, a new memory was allocated, the data was evacuated, the old ones were deleted, the pointer became incorrect, therefore such operations are prohibited.

How is a map implemented without generics?


Neither an empty interface, nor code generation has anything to do with it, the whole thing is about replacing it at compile time. Consider what functions familiar to us turn into from Go:

 v := m["k"] → func mapaccess1(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer v, ok := m["k"] → func mapaccess2(t *maptype, h *hmap, key unsafe.Pointer) (unsafe.Pointer, bool) m["k"] = 9001 → func mapassign(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer delete(m, "k") → func mapdelete(t *maptype, h *hmap, key unsafe.Pointer) 

We see that for accesses there are mapaccess functions, for recording and deleting mapassign and mapdelete, respectively. All operations use unsafe.Pointer, which does not care what type it points to, and information about each value is described by a type descriptor .

 type mapType struct { key *_type elem *_type ...} type _type struct { size uintptr alg *typeAlg ...} type typeAlg struct { hash func(unsafe.Pointer, uintptr) uintptr equal func(unsafe.Pointer, unsafe.Pointer) bool...} 

MapType stores descriptors (_type) of the key and value. For the type descriptor, operations (typeAlg) for comparison, hash taking, size and so on are defined, so we always know how to produce them.

A little more detail about how it works. When we write v = m [k] (trying to get the value of v by the key k), the compiler generates something like the following:

 kPointer := unsafe.Pointer(&k) vPointer := mapaccess1(typeOf(m), m, kPointer) v = *(*typeOfvalue)vPointer 

That is, we take a pointer to the key, the mapType structure, from which we learn what descriptors the key and values ​​are, the pointer to hmap itself (that is, the map) and pass it all to mapaccess1, which will return a pointer to the value. We bring the pointer to the necessary type, dereference and get the value.

Now let's look at the search code from runtime (which is a bit adapted for reading):

 func lookup(t *mapType, m *mapHeader, key unsafe.Pointer) unsafe.Pointer { 

The function looks for the key in the map and returns a pointer to the corresponding value, the arguments are already familiar to us - this is mapType, which stores the key type and value descriptors, the map itself (mapHeader) and a pointer to the memory that holds the key. Return a pointer to the memory storing the value.

  if m == nil || m.count == 0 { return zero } 

Next, we check if the pointer to the map mapper is not null, if there are 0 elements there, and return zero if it is.

  hash := t.key.hash(key, m.seed) // hash := hashfn(key) bucket := hash & (1<<m.logB-1) // bucket := hash % nbuckets extra := byte(hash >> 56) // extra := top 8 bits of hash b := (*bucket)(add(m.buckets, bucket*t.bucketsize)) // b := &m.buckets[bucket] 

We calculate the hash of the key (we know how to calculate for a given key from a type descriptor). Then we try to understand what kind of “bucket” we need to go and see (the remainder of the division by the number of “buckets”, the calculations are just slightly accelerated). Then we calculate the additional hash (we take 8 high bits of the hash) and determine the position of the “bucket” in the memory (address arithmetic).

  for { for i := 0; i < 8; i++ { if b.extra[i] != extra { // check 8 extra hash bits continue } k := add(b, dataOffset+i*t.key.size) // pointer to ki in bucket if t.key.equal(key, k) { // return pointer to vi return add(b, dataOffset+8*t.key.size+i*t.value.size) } } b = b.overflow if b == nil { return zero } } 

The search, if you look at it, is not so difficult: we go through the chains of “buckets”, moving on to the next, if this is not found. The search in the “bucket” begins with a quick comparison of the additional hash (this is why the e0 ... e7 at the beginning of each is the “mini” hash of the pair for a quick comparison). If it doesn’t match, we go further, if it does, then we check more carefully - we determine where the key lies in the memory, the suspect is as desired, we compare if it is equal to what was requested. If equal, determine the position of the value in memory and return. As you can see, nothing supernatural.

Conclusion


Use maps, but know and understand how they work! You can avoid the rake by understanding some subtleties - why you can’t take the address of a value, why everything drops when you declare without initialization, why it is better to allocate memory in advance, if the number of elements is known (let's avoid evacuations) and much more.



Bibliography:

"Go maps in action", Andrew Gerrand
"How the go runtime implements maps efficiently", Dave Cheney
"Understanding type in go", William Kennedy
Inside map implementation, Keith Randall
map source code, go runtime
golang spec
effective go
pictures with gophers

Source: https://habr.com/ru/post/457728/


All Articles