📜 ⬆️ ⬇️

How are arrays in PHP

In the last article I talked about variables, now we will talk about arrays.

What are PHP-level arrays?


At the PHP level, an array is an ordered list crossed with a map. Roughly speaking, PHP mixes these two concepts, as a result, it turns out, on the one hand, a very flexible data structure, on the other hand, perhaps not the most optimal, more precisely, as Anthony Ferrara put it:
PHP arrays are a great one-size-fits-all data structure. But like all one-size-fits-all <anything>, jack-of-all-trades usually means master of none.

Link to the letter

')

(the picture shows HashTable with Buckets, V. Vasnetsov)

Let's start with the following: try to measure the memory and time eaten for each inserted value. We do this using such scripts:

// time.php $ar = array(); $t = 0; for ($i = 0; $i < 9000; ++$i) { $t = microtime(1); $ar[] = 1; echo $i . ":" . (int)((microtime(1) - $t) * 100000000) . "\n"; } // memory.php $ar = array(); $m = 0; for ($i = 0; $i < 9000; ++$i) { $m = memory_get_usage(); $ar[] = 1; echo $i . ":" . (memory_get_usage() - $m) . "\n"; } 


We run these php time.php > time and php memory.php > memory scripts, and draw graphs on them, because there is a lot of data and look at the numbers for a long time (data on time has been collected many times and aligned to avoid extra jumps on the graph).


(X-axis - the number of e-tov in the array)

As you can see, both charts have jumps both in terms of consumed memory and used time, and these jumps occur at the same moments.
The fact is that at the C level (and indeed at the system level), there are no arrays with non-fixed size. Every time you create an array in C, you must specify its size so that the system knows how much memory it needs to allocate.
Then how is this implemented in PHP and how to pull up those jumps on the chart?
When you create an empty array, PHP creates it with a specific size. If you fill an array and at some point reach and exceed this size, then a new array is created with twice the size, all elements are copied into it and the old array is destroyed. In general, this is a standard approach.

And how is this implemented?


In fact, to implement arrays in PHP, we use quite a standard Hash Table data structure, the details of which we will discuss.

Hash Table stores a pointer to the very first and last value (needed for ordering arrays), a pointer to the current value (used for iterating over an array, this is what current() returns), the number of elements represented in the array, an array pointers to Buckets (about them further), and one more thing.

What for
us
need buckets
and where
us
put them down


In the Hash Table there are two main entities, the first is the Hash Table itself, and the second is the Bucket (hereinafter the bucket, so as not to get bored).

The values ​​themselves are stored in the buckets, that is, each value has its own bucket. But in addition to this, the original key is stored in the bucket, pointers to the next and previous buckets (they are needed to organize the array, because in PHP the keys can be in any order you want), and, again, something else.

Thus, when you add a new element to the array, if such a key is not there yet, then a new bucket is created under it and added to the Hash Table.

But what is most interesting is how these buckets are stored in the Hash Table.

As mentioned above, HT has a certain array of pointers to buckets, while the buckets are available in this array at a certain index, and this index can be calculated knowing the key of the bucket. It sounds a little difficult, but in fact, everything is much simpler than it seems. Let's try to make out how the item is received by key from HT:


About the mask: let's say an array of pointers contains 4 elements, so the mask is three, that is, 11 in binary.
Now if we get a number like 123 (1111011) as a key, then after applying the mask we get 3 (3 & 123), this number can already be used as an index.

After that, we will try to add in the Hash Table, with mask 3, items with the keys 54 and 90. And after applying the mask, both of these keys will be equal to two.

What to do with collisions?


The buckets are still a couple of cards in the sleeve. Each bucket also has a pointer to the previous and next bucket, in which the indices (hashes from the keys) are equal.
Thus, in addition to the main doubly linked list, which passes between all the buckets, there are also small doubly linked lists between those buckets whose indices are equal. That is, it turns out about the following picture:



Back to our case with keys 54 and 90, and mask 3. After you add 54, the HT structure will look something like this:

 { nTableSize: 4, //      nTableMask: 3, //  nNumOfElements: 1, //    HT nNextFreeElement: 55, //    ,         ( ) ..., arBuckets: [ NULL, NULL, 0xff0, // ,     Bucket,      NULL ] } 0xff0: { h: 54, //   nKeyLength: 0, //   ,      pData: 0xff1, //   ,      zval ... } 


Now add an element with the key 90, now everything will look like this:

 { nTableSize: 4, nTableMask: 3, nNumOfElements: 2, nNextFreeElement: 91, ..., arBuckets: [ NULL, NULL, 0xff0, //       Bucket NULL ] } 0xff0: { h: 54, pListNext: 0xff2, //      Bucket   ( ) pNext: 0xff2, //     Bucket     ... } 0xff2: { h: 90, pListLast: 0xff0, //    pLast: 0xff0, //       ... } 


Now let's add some elements before the overflow of nTableSize (I remind you that there will be overflow only when nNumOfElements> nTableSize).
Add items with the keys 0, 1, 3 (those that have not yet existed, and after applying the masks, they will remain the same), this is what will happen:

 { nTableSize: 8, //     (    ) nTableMask: 7, nNumOfElements: 5, nNextFreeElement: 91, //     ..., arBuckets: [ //       ,         0xff3, // key = 0 0xff4, // key = 1 0xff2, // key = 90 -     90 (  90 & 7 = 2),       6-  0xff5, // key = 3 NULL, NULL, 0xff0, // key = 54 NULL ] } 0xff0: { h: 54, pListNext: 0xff2, //    ,     pNext: NULL, //     ... } 0xff2: { h: 90, pListNext: 0xff3, pListLast: 0xff0, pLast: NULL, ... } 0xff3: { h: 0, pListNext: 0xff4, pListLast: 0xff2, ... } 0xff4: { h: 1, pListNext: 0xff5, pListLast: 0xff3, ... } 0xff5: { h: 3, pListNext: NULL, //     pListLast: 0xff4, ... } 


What happens after an array overflow is called rehash. In essence, this is an iteration over all existing buckets (via pListNext), the assignment of their neighbors (collisions) and the addition of references to them in arBuckets.

But what really gets the item on the key? To the previous algorithm one more thing will be added:

So until either the buckets run out in pNext (then Notice is thrown out), or until a match is found.

It is worth noting that in PHP almost everything is built on this HashTable structure alone: ​​all variables in any scope are actually in HT, all methods of classes, all fields of classes, even the class definitions themselves are in HT, It is actually a very flexible structure. Among other things, HT provides almost the same sampling / insertion / deletion speed and the complexity of all three is O (1), but with a reservation to a small overhead in case of collisions.

By the way, here I implemented the Hash Table in PHP itself. Well, that is, implemented PHP-shnye arrays in PHP =)

Source: https://habr.com/ru/post/162685/


All Articles