In the
previous article , we considered the criteria for choosing between a GUID and an auto increment as the primary key. The main idea was that if for some basic criteria (replication, requirements for uniqueness, etc.) there is a need to use a GUID, then you need to take into account the nuances associated with its performance. Tests of insertion of records showed that sequential GUIDs generated on the client are the fastest option, and the insertion in this case works even faster than when using autoincrement. But the article did not consider the loss in sampling performance when using the GUID as a key. In this article I will try to close this space.
We will test the following scenarios:
- Search for a separate entry by primary key with reading one child collection
- Search for a separate entry by the primary key with reading several collections
- Selection of several records with reading one child collection for each record
- Selection of several records with reading of several child collections for each record
- Selection of all records from the table with reading one child collection for each record
- Selection of all records from the table with reading of all child collections
Separation into a scenario with one child collection and several is needed to understand how much the number of joints (JOIN) affects the key in one query for the final result. And how much performance decrease depends on the type of key.
Using these scripts, we compare three key options:
- Auto increment
- Sequential guid
- Inconsistent GUID
To do this, we need three sets of tables, each of which consists of a main table and 5 tables for the child records.
Testing will be conducted, as in the previous article, using the Entity Framework 6.1.3. Database - Microsoft SQL Server 2014 Developer Edition (64 bit). Add 10,000 records to each of the main tables, where each record will have 5 child collections containing 10 elements each. Thus, the tables in the child collections will contain 100,000 rows each.
')
The source code of the test program can be found
here .
Below are charts of the results. For ease of comparison, percentages are indicated on them, where the minimum result of all the options is taken as 100%.
The results show that the sampling operations operating with integer keys are, of course, faster than in the case of the GUID. But the difference in performance between a sequential GUID and an auto-increment is not so great as to tell a significant loss when choosing a GUID.
It should also be noted that the percentage difference in the execution time is preserved with an increase in the number of connections in a single request. What, in fact, was to be expected, but it was necessary to make sure.
Inconsistent GUID showed in some cases noticeably worse results, which, again, is natural. In practice, the difference between it and the other two options may turn out to be even greater - in cases where the database cannot be fully cached in RAM and a large amount of disk readings are required to obtain all the neighboring child records (a similar test was described in a previous article to insert records). Accordingly, to use such a variant of the primary key in practice does not make sense.
Despite the fact that this test showed a slight difference between auto-increment and sequential GUID, I would not consider the latter as an option that can be thoughtlessly used in all situations. The GUID takes up more space, especially in the text view. If there is a transformation of objects into text format (JSON, XML) in the system, and the portion of objects that is converted at one time contains a large number of identifiers - the difference in volume compared to integer keys can be significant. The inverse transformation (deserialization) for GUID is much slower than for a number, but, in my opinion, this difference can be neglected. This time is still very short - on my machine it takes 60 milliseconds to parse a hundred thousand GUIDs, compared to 12 milliseconds for integers. A more tangible problem with the use of unique identifiers may be that they are much more difficult to operate with support and debugging than with integers.
I would formulate the general conclusion on the basis of both articles as follows: if for any reasons not related to performance, there is a need to use the GUID as the primary key, then from the possible implementations, you should choose a sequential one generated on the client. At the same time, in most cases, it is possible not to worry about the performance loss of read operations. The main GUID problem in terms of sampling performance is not that the processor needs to do more actions for comparison (this is most likely optimized, and there is little comparison when searching by key, if it is not a scan), but that it takes 4 times the usual int, respectively, fewer pages of indexes and data can be cached in the same amount of RAM. If in your database the indexes of primary and foreign keys occupy tens of percent of the total base volume and the amount of actively used data cannot be fully cached, and there is no possibility to increase the amount of RAM, it makes sense to think about lighter keys. But this situation seems to me quite rare. You should also pay attention to the potential problems with serialization / deserialization mentioned above. However, the corresponding case that I saw in practice was not so much a problem of the GUID itself as a problem of incorrect API design - the entire collection was read at once without restrictions, paging, etc.