Hello everyone, April 30, the course
“Algorithms for Developers” starts at OTUS, and this is where the publication of today's material is timed. Let's start.

In this article, you will learn how dictionaries are implemented in Python.
Dictionaries are indexed using keys, and they can be viewed as associated arrays. Let's add 3 key / value pairs to the dictionary:
')
>>> d = {'a': 1, 'b': 2} >>> d['c'] = 3 >>> d {'a': 1, 'b': 2, 'c': 3}
Values can be accessed as follows:
>>> d['a'] 1 >>> d['b'] 2 >>> d['c'] 3 >>> d['d'] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'd'
The key
“d”
does not exist, so a KeyError error will appear
Hash tablesDictionaries in Python are implemented using hash tables. They are arrays whose indices are calculated using hash functions. The purpose of the hash function is to evenly distribute the keys in the array. A good hash function minimizes the number of collisions, i.e. the probability that different keys will have one hash. Python doesn't have this kind of hash function. Its most important hash functions (for strings and integer values) produce similar values in general cases:
>>> map(hash, (0, 1, 2, 3)) [0, 1, 2, 3] >>> map(hash, ("namea", "nameb", "namec", "named")) [-1658398457, -1658398460, -1658398459, -1658398462]
We will assume that by the end of this article we will use strings as keys. The hash function in Python for strings is defined as follows:
arguments: string object returns: hash function string_hash: if hash cached: return it set len to string's length initialize var p pointing to 1st char of string object set x to value pointed by p left shifted by 7 bits while len >= 0: set var x to (1000003 * x) xor value pointed by p increment pointer p set x to x xor length of string object cache x as the hash so we don't need to calculate it again return x as the hash
If you execute
hash('a')
in Python, it will work out
string_hash()
and return
12416037344
. Here we are using the default 64-bit machine.
If an array of size
used to store value / key pairs, then a mask will be used to calculate the cell index of the pair in the array, which is calculated as
-1
. This approach makes computing cell indices quick. The probability of finding an empty cell is quite high due to the resizing mechanism, which is described below. This means that a simple calculation makes sense in most cases. The array size is 8, the index for
'a'
will be:
hash('a') & 7 = 0
. The index for
'b'
is 2, the index for
'c'
is 3, the index for
'z'
is 3, just like for
'b'
, and it is here that we have a collision.

As we can see, the hash function in Python performs its work qualitatively, in the case when the keys are sequential, which is good, since you often have to work with such data. However, as soon as we add the key
'z'
, a collision occurs because it is not consistent with the previous ones.
We could use a linked list for storing pairs, while having the same hash, but that would increase the search time, and it would not equal O (1) on average. The following section describes the collision resolution method used for dictionaries in Python.
Public addressingOpen addressing is a collision resolution method that uses probing. In the case of
'z'
, cell 3 is already used in the array, so we need to find another index that has not yet been used. The operation of adding a key / value pair takes an average O (1), as well as a search operation.
To search for free cells, a quadratic probing sequence is used. It is implemented as follows:
j = (5*j) + 1 + perturb; perturb >>= PERTURB_SHIFT; use j % 2**i as the next table index;
Recursion in (5 * j) +1 quickly increases large differences in bits that did not affect the original index. The variable
"perturb"
in this case takes the other bits of the hash code.
Out of curiosity, let's see what happens if we have a probing sequence with a table size of 32 and j = 3.
3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 -> 2 ...
You can learn more about this sampling sequence by accessing the source code for
dictobject.c . A detailed explanation of the operation of the probing mechanism can be found at the top of the file.

Let's look at the Python source code with this example.
Dictionary structure CThe following C structure is used to store entries in the dictionary: key / value pair. Stores hash, key and value.
PyObject
is the base class for objects in Python.
typedef struct { Py_ssize_t me_hash; PyObject *me_key; PyObject *me_value; } PyDictEntry;
The following structure is a dictionary.
ma_fill
is the total number of used and inactive cells. A slot is considered inactive when a key pair is removed.
ma_used
is the number of cells used (active).
ma_mask
equals array size -1 and is used to calculate the cell index.
ma_table
is an array, and
ma_smalltable
is the original array of size 8.
typedef struct _dictobject PyDictObject; struct _dictobject { PyObject_HEAD Py_ssize_t ma_fill; Py_ssize_t ma_used; Py_ssize_t ma_mask; PyDictEntry *ma_table; PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash); PyDictEntry ma_smalltable[PyDict_MINSIZE]; };
Dictionary initializationWhen you first create a dictionary, the
PyDict_New()
function is
PyDict_New()
. I deleted some lines and converted the C code to pseudocode to focus on key concepts.
PyDict_New()
function:
- Returns a dictionary object;
- Allocates a new dictionary object;
- Clears the dictionary table;
- Sets the number of used dictionary cells and unused cells (
ma_fill
) to 0; - Sets the number of active cells (
ma_used
) to 0; - Sets the dictionary mask (
ma_value
) to a value equal to the size of the dictionary - 1 = 7; - Sets the dictionary search function
lookdict_string
; - Returns an allocated dictionary object.
Add itemWhen a new key / value pair is added,
PyDict_SetItem()
called. This function takes as input a pointer to a dictionary object and a key / value pair. It checks whether the key is a string and calculates a hash or re-uses the cached one if it exists.
insertdict()
is called to add a new key / value pair and the dictionary size changes if the number of used and unused cells is more than 2/3 of the array size.
Why exactly 2/3? This is necessary to make sure that the probing sequence can find free cells quickly enough. Later we will consider the function for resizing.
arguments: dictionary, key, value returns: 0 if OK or -1 function PyDict_SetItem: if key's hash cached: use hash else: calculate hash call insertdict with dictionary object, key, hash and value if key/value pair added successfully and capacity over 2/3: call dictresize to resize dictionary's table
inserdict()
uses the
lookdict_string()
lookup function to find an
lookdict_string()
cell. The same function is used to find the key.
lookdict_string()
calculates the cell index using a hash and mask values. If she cannot find the key by the value cell index = hash & mask (slot index = hash & mask), she starts probing using the cycle described above until she finds a free cell. On the first attempt at probing, if the key is
null
, it returns an unused cell if it is found during the first search. This ensures priority for the reuse of previously deleted cells.
We want to add the following key / value pairs:
{'a': 1, 'b': 2′, 'z': 26, 'y': 25, 'c': 5, 'x': 24}
. Here is what will happen:
The dictionary structure is allocated with a table size of 8.
- PyDict_SetItem: key = 'a', value = 1
- hash = hash ('a') = 12416037344
- insertdict
- lookdict_string
- slot index = hash & mask = 12416037344 & 7 = 0
- slot 0 not used, return this cell
- initialization of entry at index 0 with key, value and hash
- ma_used = 1, ma_fill = 1
- PyDict_SetItem: key = 'b', value = 2
- hash = hash ('b') = 12544037731
- insertdict
- lookdict_string
- slot index = hash & mask = 12544037731 & 7 = 3
- slot 3 not used, return this cell
- initialization of entry at index 3 with key, value and hash
- ma_used = 2, ma_fill = 2
- PyDict_SetItem: key = 'z', value = 26
- hash = hash ('z') = 15616046971
- insertdict
- lookdict_string
- slot index = hash & mask = 15616046971 & 7 = 3
- slot 3 is in use, try another slot: 5 is free
initialization of entry at index 5 with key, value and hash
ma_used = 3, ma_fill = 3
- PyDict_SetItem: key = 'y', value = 25
- hash = hash ('y') = 15488046584
- insertdict
- lookdict_string
- slot index = hash & mask = 15488046584 & 7 = 0
- slot 0 is used, try another cell: 1 is free
- initialization of entry at index 1 with key, value and hash
- ma_used = 4, ma_fill = 4
PyDict_SetItem: key = 'c', value = 3
- hash = hash ('c') = 12672038114
- insertdict
- lookdict_string
- slot index = hash & mask = 12672038114 & 7 = 2
- slot 2 not used, return this cell
- initialization of entry at index 2 with key, value and hash
- ma_used = 5, ma_fill = 5
PyDict_SetItem: key = 'x', value = 24
- hash = hash ('x') = 15360046201
- insertdict
- lookdict_string
- slot index = hash & mask = 15360046201 & 7 = 1
- slot 1 is used, try another cell: 7 is free
- initialization of entry at index 7 with key, value and hash
- ma_used = 6, ma_fill = 6
Here is what we get:

Now 6 cells out of 8 are used, more than 2/3 of the array capacity is occupied.
dictresize()
is called to allocate a larger array. This function also deals with copying records from the old table to the new one.
dictresize ()
is called with
minused
= 24 in our case, where 4 *
ma_used
. 2 *
ma_used
used when the number of cells used is very large (more than 50,000). Why 4 times more cells? This reduces the number of steps to implement resizing and increases sparseness.
The new size of the table should be greater than 24, it is calculated by shifting the current size by 1 bit to the left until the size of the table is larger than 24. As a result, it will be 32, for example, 8 -> 16 -> 32.
Here is what happens to our table during resizing: a new table with a size of 32 is allocated. Old table entries are inserted into a new table using a new mask value of 31. As a result, the following is obtained:
Deleting itemsPyDict_DelItem()
is called to delete entries. For the record key, the hash is calculated, then the search function is called to return the record. Now the cell is empty.
We want to remove the "c" key from our dictionary. As a result, we get the following array:

Please note that the operation of deleting an element does not change the size of the array, if the number of cells used is much smaller than their total number. However, when a key / value pair is added, the need for resizing depends on the number of used and inactive cells, so the addition operation can also reduce the array.
This publication has come to an end, and we traditionally welcome your comments and invite everyone to an
open lesson , which will be held on April 18th.