
In case you are sharing data across multiple physical databases,
support for globally unique identifiers is not such a trivial task.
I tried to put together possible options and consider their pros and cons.
What requirements (except for uniqueness) can be placed on keys?
The key should monotonously increaseWhy do you need it:
- Automatic natural sorting (including chronological)
- Faster insertion than with random keys
The above is true only for indexes using trees.
')
The key must be generated on the client sideWhy do you need it:
- An application can immediately access the desired shard (cluster node), since the unique key is known even before the object is saved
- An application can save several related objects at once (in a single transaction or batch job)
Limit key size (for example, only 64 bits)Why do you need it:
- In some cases, 128 or 160 bit identifiers can be a problem. For example, Sphinx search supports only integer identifiers of documents (64 bits in the 64-bit version).
- An index with 32-bit or 64-bit keys should theoretically work faster than with keys of arbitrary length.
- An index with 32-bit or 64-bit keys is more compact and, as a result, fits entirely into memory (important for large amounts of data).
What are the ways to generate unique keys?
ID generation on the application side (UUID, etc.)With
UUID, everything is relatively simple - we take the library and generate keys on the application side. UUID is widely used in many systems. The value is formed in such a way that collisions are impossible.
Minuses:• The length of the standard UUID is 128 bits (which, in fact, you often want to store directly as a string of hexadecimal digits, which is already 32 bytes)
• As a rule, the UUID does not provide a natural key sorting.
Separate service for key generationThis option is used, for example,
Flick and
Twitter .
Minuses:• complication of the system due to the introduction of additional components
• a potentially single point of failure, or additional efforts are required to ensure high availability of the key generator
Auto-increment on database side by value rangeThe whole range of values is divided into subranges and each shard (cluster node) is responsible for its own range, for example:
1 - 0 .. 3FFFFFFF,
2 - 40000000 .. 7FFFFFFF,
3 - 80000000 .. BFFFFFFF,
4 - C0000000 .. FFFFFFFF
Minuses:• DB must support auto-increment (not applicable to many NoSQL-storages)
natural sorting is impossible
• The client "does not know" the key before inserting the object. To find out the key you need to make a separate request.
• It is almost impossible to increase the number of cluster nodes.
Instagram autoincrementThis decision is
described in the technical blog of the
Instgram team (photo sharing application).
They use a 64-bit key that is generated on the DB (PostgreSQL) side and consists of bit fields.
63 31 0 |======|======|======|======|======|======|======|======| | CUSTOM TIMESTAMP | SHARD ID | AUTO |
Where
-
CUSTOM TIMESTAMP
(41 bits) - time in milliseconds from 2011-01-01 00:00:00
-
SHARD ID
(13 bits) - logical partition identifier (the number of shards is greater than the number of physical nodes)
-
AUTO
(10 bits) - sequence (sequence), unique within the logical partition (shard)
Pluses (compared to autoincrement by ranges):• Automatic chronological sorting
And what options do you know and use?