📜 ⬆️ ⬇️

Python dictionary implementation

Hello everyone, April 30, the course “Algorithms for Developers” starts at OTUS, and this is where the publication of today's material is timed. Let's start.



In this article, you will learn how dictionaries are implemented in Python.
Dictionaries are indexed using keys, and they can be viewed as associated arrays. Let's add 3 key / value pairs to the dictionary:
')
>>> d = {'a': 1, 'b': 2} >>> d['c'] = 3 >>> d {'a': 1, 'b': 2, 'c': 3} 

Values ​​can be accessed as follows:

 >>> d['a'] 1 >>> d['b'] 2 >>> d['c'] 3 >>> d['d'] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'd' 

The key “d” does not exist, so a KeyError error will appear

Hash tables

Dictionaries in Python are implemented using hash tables. They are arrays whose indices are calculated using hash functions. The purpose of the hash function is to evenly distribute the keys in the array. A good hash function minimizes the number of collisions, i.e. the probability that different keys will have one hash. Python doesn't have this kind of hash function. Its most important hash functions (for strings and integer values) produce similar values ​​in general cases:

 >>> map(hash, (0, 1, 2, 3)) [0, 1, 2, 3] >>> map(hash, ("namea", "nameb", "namec", "named")) [-1658398457, -1658398460, -1658398459, -1658398462] 

We will assume that by the end of this article we will use strings as keys. The hash function in Python for strings is defined as follows:

 arguments: string object returns: hash function string_hash: if hash cached: return it set len to string's length initialize var p pointing to 1st char of string object set x to value pointed by p left shifted by 7 bits while len >= 0: set var x to (1000003 * x) xor value pointed by p increment pointer p set x to x xor length of string object cache x as the hash so we don't need to calculate it again return x as the hash 

If you execute hash('a') in Python, it will work out string_hash() and return 12416037344 . Here we are using the default 64-bit machine.

If an array of size used to store value / key pairs, then a mask will be used to calculate the cell index of the pair in the array, which is calculated as -1 . This approach makes computing cell indices quick. The probability of finding an empty cell is quite high due to the resizing mechanism, which is described below. This means that a simple calculation makes sense in most cases. The array size is 8, the index for 'a' will be: hash('a') & 7 = 0 . The index for 'b' is 2, the index for 'c' is 3, the index for 'z' is 3, just like for 'b' , and it is here that we have a collision.



As we can see, the hash function in Python performs its work qualitatively, in the case when the keys are sequential, which is good, since you often have to work with such data. However, as soon as we add the key 'z' , a collision occurs because it is not consistent with the previous ones.

We could use a linked list for storing pairs, while having the same hash, but that would increase the search time, and it would not equal O (1) on average. The following section describes the collision resolution method used for dictionaries in Python.

Public addressing

Open addressing is a collision resolution method that uses probing. In the case of 'z' , cell 3 is already used in the array, so we need to find another index that has not yet been used. The operation of adding a key / value pair takes an average O (1), as well as a search operation.

To search for free cells, a quadratic probing sequence is used. It is implemented as follows:

 j = (5*j) + 1 + perturb; perturb >>= PERTURB_SHIFT; use j % 2**i as the next table index; 

Recursion in (5 * j) +1 quickly increases large differences in bits that did not affect the original index. The variable "perturb" in this case takes the other bits of the hash code.

Out of curiosity, let's see what happens if we have a probing sequence with a table size of 32 and j = 3.

3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 -> 2 ...

You can learn more about this sampling sequence by accessing the source code for dictobject.c . A detailed explanation of the operation of the probing mechanism can be found at the top of the file.



Let's look at the Python source code with this example.

Dictionary structure C

The following C structure is used to store entries in the dictionary: key / value pair. Stores hash, key and value. PyObject is the base class for objects in Python.

 typedef struct { Py_ssize_t me_hash; PyObject *me_key; PyObject *me_value; } PyDictEntry; 

The following structure is a dictionary. ma_fill is the total number of used and inactive cells. A slot is considered inactive when a key pair is removed. ma_used is the number of cells used (active). ma_mask equals array size -1 and is used to calculate the cell index. ma_table is an array, and ma_smalltable is the original array of size 8.

 typedef struct _dictobject PyDictObject; struct _dictobject { PyObject_HEAD Py_ssize_t ma_fill; Py_ssize_t ma_used; Py_ssize_t ma_mask; PyDictEntry *ma_table; PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash); PyDictEntry ma_smalltable[PyDict_MINSIZE]; }; 

Dictionary initialization

When you first create a dictionary, the PyDict_New() function is PyDict_New() . I deleted some lines and converted the C code to pseudocode to focus on key concepts.

PyDict_New() function:


Add item

When a new key / value pair is added, PyDict_SetItem() called. This function takes as input a pointer to a dictionary object and a key / value pair. It checks whether the key is a string and calculates a hash or re-uses the cached one if it exists. insertdict() is called to add a new key / value pair and the dictionary size changes if the number of used and unused cells is more than 2/3 of the array size.

Why exactly 2/3? This is necessary to make sure that the probing sequence can find free cells quickly enough. Later we will consider the function for resizing.

 arguments: dictionary, key, value returns: 0 if OK or -1 function PyDict_SetItem: if key's hash cached: use hash else: calculate hash call insertdict with dictionary object, key, hash and value if key/value pair added successfully and capacity over 2/3: call dictresize to resize dictionary's table 

inserdict() uses the lookdict_string() lookup function to find an lookdict_string() cell. The same function is used to find the key.

lookdict_string() calculates the cell index using a hash and mask values. If she cannot find the key by the value cell index = hash & mask (slot index = hash & mask), she starts probing using the cycle described above until she finds a free cell. On the first attempt at probing, if the key is null , it returns an unused cell if it is found during the first search. This ensures priority for the reuse of previously deleted cells.
We want to add the following key / value pairs: {'a': 1, 'b': 2′, 'z': 26, 'y': 25, 'c': 5, 'x': 24} . Here is what will happen:

The dictionary structure is allocated with a table size of 8.


PyDict_SetItem: key = 'c', value = 3

PyDict_SetItem: key = 'x', value = 24

Here is what we get:



Now 6 cells out of 8 are used, more than 2/3 of the array capacity is occupied. dictresize() is called to allocate a larger array. This function also deals with copying records from the old table to the new one.

dictresize () is called with minused = 24 in our case, where 4 * ma_used . 2 * ma_used used when the number of cells used is very large (more than 50,000). Why 4 times more cells? This reduces the number of steps to implement resizing and increases sparseness.

The new size of the table should be greater than 24, it is calculated by shifting the current size by 1 bit to the left until the size of the table is larger than 24. As a result, it will be 32, for example, 8 -> 16 -> 32.

Here is what happens to our table during resizing: a new table with a size of 32 is allocated. Old table entries are inserted into a new table using a new mask value of 31. As a result, the following is obtained:



Deleting items

PyDict_DelItem() is called to delete entries. For the record key, the hash is calculated, then the search function is called to return the record. Now the cell is empty.

We want to remove the "c" key from our dictionary. As a result, we get the following array:



Please note that the operation of deleting an element does not change the size of the array, if the number of cells used is much smaller than their total number. However, when a key / value pair is added, the need for resizing depends on the number of used and inactive cells, so the addition operation can also reduce the array.

This publication has come to an end, and we traditionally welcome your comments and invite everyone to an open lesson , which will be held on April 18th.

Source: https://habr.com/ru/post/448350/


All Articles