Each table in SQLite by default contains a private key based on an automatically generated 64-bit integer. It is effective and convenient in most situations. The inconvenience begins, perhaps, only in two cases:
It may seem that the second task in combination with SQLite should not arise, but distribution does not always mean something like BigData. A typical example (which is why I personally needed a study on this topic) is an application with the ability to synchronize data between devices. It can be as small as a notebook, or more loaded, like browser history. The problem here is not so much the amount of data as the merging of several databases. Obviously, integer record counters, starting from 1, will inevitably produce conflicting sequences, and therefore it is no longer possible to use them as a unique record identifier on several devices. It is possible to be confused with division into subranges or "shifting" aydishnikov records before transferring them, but these are all curves and fragile crutches. Nobody does that, of course. Instead, each device assigns something like a GUID to its entries — simply and securely.
A GUID is a random "number", 128 bits long. That is, in the database it will be 16 bytes in the form of a BLOB, or at least 32 bytes in the form of a string. A certain overhead projector (especially if the remaining columns are small) compared to the default key, which is stored very efficiently: usually there is not 8 bytes, but as much as the value of the key requires to be represented. We are ready to pay this overhead for the sake of solving the problem, but we don’t want to exacerbate it - therefore, of course, we prefer binary storage rather than textual strings.
Well, it's easy to declare a blob column, let's make a primitive label:
CREATE TABLE records (id BLOB PRIMARY KEY, data CHARACTER);
You can also add a WITHOUT ROWID
as a table specifier for optimization — so that SQLite does not add or support an implicit key column.
The topic could be closed if it were not for the desire to force the database to generate IDs on its own, just as in the case of the default integer key. Well, if there is no fundamental requirement to have real GUIDs (which are not just a random number, but have several predefined bits), then this is also easy:
CREATE TABLE records (id BLOB PRIMARY KEY DEFAULT (randomblob(16)), data CHARACTER);
It would be possible to stop at this again, if not for perfectionism. Looking ahead, I will say that even with perfectionism of a severe form, it is worth staying at this if you are not going to insert millions of records into the database. But if it can be that you are going, then you should pay attention that a long random number is not very well suited as a key for the reason that it is more expensive to index: just add the records one by one and hope for a quick search will not work.
This problem is fortunately solved long ago and successfully, and the solution has been adapted to many databases . In short, everything is simple there: the first 6 bytes of the identifier are replaced by the timestamp. As a result, records are created immediately (partially) ordered, which greatly facilitates their indexing. The probability of collisions at the same time increases, but only slightly. And again, the story would have ended exactly at this place, if in the Android API SQLiteDatabase would allow defining external functions to generate a guid-like BLOB. You can of course generate them in Java code and bind to all insert requests, but this is unsportsmanlike. In addition, there may be other reasons for not doing this. For example, the need to keep "global" identifiers separate from "local" ones, generating them as necessary using a trigger.
Well, take 6 bytes from the unix timestamp with a sin in half, you can :
SELECT round((julianday('now') - 2440587.5) * 86400000) & 0xFFFFFFFFFFFF AS ts;
The result will be a number. For example, this: 1489877740453 - at the time of this writing. The good news is that it will usually be non-decreasing, and it can be considered the means of the database itself. But then some difficulties begin. The fact is that SQLite has a very limited set of functions for working with BLOBs: just trim ( substr()
) and glue ( ||
). And how to make the number interpret as a string of bytes is not clear. That is, you can of course make CAST(... AS BLOB)
, but this is not it: it will translate the number into a string , and then take the bytes of the received string — that is, turn 6 bytes into 13. Even if you preformat it in hexadecimal , there will be a lot - 12. Does not roll.
... in SQLite is impossible - google and stackoverflow will answer you. That's the way it is, of course, true, but if you really want it, then you can actually. I couldn't find anything on the Internet, and I had to invent it myself. I must say: it will be dirty :)
So, we have a gluing ( ||
), so having two byte strings — the timestamp and the random part — we could get Jimmy Nelson's COMB:
SELECT ts_bytes || randomblob(10);
The ts_bytes
required is just a string of 6 bytes representing an integer. Let's take a look at it 1489877740453
: 1489877740453
. Or 0x 01 5A E3 A2 2B A5
. If we could take individually each byte in the form of BLOBs, and glue them together - even in manual mode, this is all (and always) 6 glues. Well, let's try to divide the number into bytes. Their numerical values ​​can be obtained with the help of small arithmetic:
ts >> 40 == 1
(0x01)(ts >> 32) % 256 == 90
(0x5A)(ts >> 24) % 256 == 227
(0xE3)But, again, this is not bytes yet. SQLite interpreter will consider it just numbers:
SELECT typeof( (1489877740453 >> 24) % 256 ); integer
And we need a blob. BLOB of one byte representing the resulting number. Obviously, we cannot do this, but if we had something like a table - byte values ​​are only 256 pieces . Here we recall the second operation available in SQLite and returning bytes - substr
, which by index returns a substring of letters or bytes . Bingo! Zahardkodim all the values ​​of bytes in the string, where the index value of this byte will be. Fortunately, you can write a binary literal using the syntax of the form x'DEADBEEF ':
SELECT X' 000102030405060708090A0B0C0D0E0F 101112131415161718191A1B1C1D1E1F 202122232425262728292A2B2C2D2E2F 303132333435363738393A3B3C3D3E3F 404142434445464748494A4B4C4D4E4F 505152535455565758595A5B5C5D5E5F 606162636465666768696A6B6C6D6E6F 707172737475767778797A7B7C7D7E7F 808182838485868788898A8B8C8D8E8F 909192939495969798999A9B9C9D9E9F A0A1A2A3A4A5A6A7A8A9AAABACADAEAF B0B1B2B3B4B5B6B7B8B9BABBBCBDBEBF C0C1C2C3C4C5C6C7C8C9CACBCCCDCECF D0D1D2D3D4D5D6D7D8D9DADBDCDDDEDF E0E1E2E3E4E5E6E7E8E9EAEBECEDEEEF F0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF' as b;
This is slightly pseudo-code, because it is impossible to wrap the lines this way, but it is more obvious, and in the code it is possible to format it in one line. But now all that is left to do is to "cut" the desired byte from the table and glue it with the others. Six times:
SELECT substr(b, (ts >> 40) + 1, 1) || substr(b, (ts >> 32) % 256 + 1, 1) || substr(b, (ts >> 24) % 256 + 1, 1) || substr(b, (ts >> 16) % 256 + 1, 1) || substr(b, (ts >> 8) % 256 + 1, 1) || substr(b, ts % 256 + 1, 1) || randomblob(10);
Pseudo-GUID in binary form is ready! In fact, SQLite will still treat the received byte string as "text", but CAST(... AS BLOB)
will do everything as it should. In fact, this is even mandatory, because otherwise reading from this column will not return 16 bytes, as expected, but 17 - with a zero terminator string. It remains to substitute the expression as the value of the default column.
Simply stuffing this whole train into DEFAULT(...)
in the table column definition is impossible, because there must be only "simple" expressions, but we need nested SELECTs to avoid copy-paste and multiple calculations of the same.
Fortunately, SQLite has triggers with which you can modify strings on the fly when inserting. Unfortunately, neither the BEFORE INSERT
phase nor the AFTER INSERT
phase is suitable for PRIMARY KEY
, since to satisfy the implicit NOT NULL
condition, the column value must be specified in the original query. In addition, for such triggers, the UPDATE expression again allows only primitive expressions. But the INSTEAD OF INSERT
trigger type is available, which can just create a new record based on the passed values ​​with the addition of the generated blob. There is only one feature with it that is not indicated in the documentation: INSTEAD OF INSERT trigger cannot be created on a table. It is possible only on VIEW.
As a result, the scheme is built as follows:
CREATE TABLE records (id BLOB PRIMARY KEY, data CHARACTER) WITHOUT ROWID; CREATE VIEW fake AS SELECT NULL as ts, NULL as data; CREATE TRIGGER auto_guids INSTEAD OF INSERT ON fake BEGIN INSERT INTO records(id, data) SELECT CAST(new_guid AS BLOB), NEW.data FROM ( SELECT substr(b, (ts >> 40) + 1, 1) || substr(b, (ts >> 32) % 256 + 1, 1) || substr(b, (ts >> 24) % 256 + 1, 1) || substr(b, (ts >> 16) % 256 + 1, 1) || substr(b, (ts >> 8) % 256 + 1, 1) || substr(b, ts % 256 + 1, 1) || randomblob(10) AS new_guid FROM (SELECT round((julianday('now') - 2440587.5) * 86400000) & 0xFFFFFFFFFFFF as ts, x'000102030405060708090A0B0C0D0E0F...' as b ) ); END;
We read as usual:
SELECT * FROM records;
And we write it this way:
INSERT INTO fake (data) VALUES ('Hello COMBs!');
It was possible to put the byte table in a separate VIEW for readability, but this somewhat affects the performance. It was also possible to leave the primary key on the integer counter, make the guid
column simply unique and write the ON AFTER INSERT
trigger that would "add" the row with the new guid
, but, running a little ahead, I would say that it is about 30% slower. By the way, it's time to look at the performance.
Obviously, manually gluing the bytes is slower than the built-in function randomblob()
. Winning should appear on a large number of inserts. We will take measurements. We will compare the “regular” integer ROWID, the key based on randomblob(16)
and our partially ordered blobs (COMBs, as they were called in the above article).
The test script is:
The recording time is measured both for each series, and inside every 20% of the records. Tests were run in the Android 6.0 emulator (SQLite 3.8.10). Sources here .
On the chart: the time of insertion of each subsequent portion of 200 thousand records. Performance standard, of course, the default integer index (blue line). Its speed does not depend on the number of consecutive inserts. The yellow line (COMBs) is our patient. Its speed is also almost constant, although it is 55-59% lower. And the red line is the table with the primary key on randomblob (16). It can be seen that starting just 11% slower than INTEGER PRIMARY KEY, somewhere after the first million inserts, its overhead for maintaining the index exceeds the partially ordered sequences and continues to grow, reaching 75% deceleration by the end of the 3rd million.
In fact, COMB can be made even faster. The current problem is that with millisecond accuracy of time stamps, adjacent lines are arranged in clusters of 18-20 pieces, where the first 6 bytes (timestamps) are the same, partially returning the problem of arranging random bytes. If you somehow add the sequence number of the added record to the timestamp (at least within the transaction), it will reduce the overhead to 29-34% compared to "INT" and give a gain compared to randomblob(16)
after 500 thousand entries.
The disadvantage is that in the simplest case, the sequence number must be transferred from the control code, and I do not want to do this according to the conditions of the problem. In addition, it is already possible to draw conclusions.
SQLite itself is very well managed with indexing even something GUID-o-like.
If the estimated amount of data does not exceed at least 500 thousand records, pure randomblob()
has quite acceptable performance. I’ll probably choose it in my current project.
Even if there are many records, but they are rarely inserted, or, especially, in the form of single records, the primary key type will not play any role at all in performance. Only one transaction commit (in Android with default database settings) takes about 20-50 milliseconds. And many times more if the IO system is loaded. Inserting an entry within a mass transaction that occurs in microseconds, in comparison with this, takes negligible time in any case.
In SQLite, you can turn numbers into BLOBs - there would be a desire :)
Source: https://habr.com/ru/post/323794/
All Articles