SQLite and UNICODE

The first part is introductory .
The second part is a quick start .
The third part - features .

Despite the fact that this topic was touched on Habré before, some key things did not sound. This article attempts to "close the topic." Addition / correction notes are welcome.

Format strings in the database

The SQLite database can store text (string values) in UTF-8 format or in UTF-16 format in a database. In this case, the byte order in 16-bit UTF-16 characters can be either little-endian or big-endian.
')
That is, in reality there are three different SQLite database formats: UTF-8, UTF-16le, UTF-16be .

Any of these formats can be used on any platform (i.e., nothing prevents you from creating a UTF-16be base on x86, although this is unwise, see below).

The row format is a database setting that is set before the database is created. After creating the database, you cannot change the format; attempts to do this are silently ignored by the SQLite kernel.

So,
SQLite database string format can be one of:
- UTF-8 (used by default);
- UTF-16le (native to x86);
- UTF-16be
and you cannot change it after creating a database.

Notes

1. Both UTF-8 and UTF-16 (see “surrogate pairs”) use a variable (not fixed) number of bytes to store one character.

2. ATTACH to the database, you can only base with the same row format, otherwise there will be an error.

3. From your version of SQLite, support for UTF-16 may have been “removed” during the build (see sqlite3.c for SQLITE_OMIT_UTF16 ).

Passing Strings to API Calls

SQLite API calls (in C language) are divided into two types: receiving strings in UTF-16 format (byte order "native" for the platform) and receiving strings in UTF-8 format. For example:

int sqlite3_prepare_v2( sqlite3 *db, /* Database handle */ const char *zSql, /* SQL statement, UTF-8 encoded */ int nByte, /* Maximum length of zSql in bytes. */ sqlite3_stmt **ppStmt, /* OUT: Statement handle */ const char **pzTail /* OUT: Pointer to unused portion of zSql */ ); int sqlite3_prepare16_v2( sqlite3 *db, /* Database handle */ const void *zSql, /* SQL statement, UTF-16 encoded */ int nByte, /* Maximum length of zSql in bytes. */ sqlite3_stmt **ppStmt, /* OUT: Statement handle */ const void **pzTail /* OUT: Pointer to unused portion of zSql */ );

If the format of the base string does not match the format of the string that was passed to the API, then the transferred string will be encoded on the fly into the base format. This process is bad because it takes resources and, of course, it should be avoided. However, he is not only bad because of this, see below.

Collation: a way to compare strings

The next topic is related to the ordering of the lines relative to each other (ascending or descending) and the answer to the question "are these two lines equal". There is no “natural” way to compare two strings. At a minimum, the question arises: case-insensitive or case-insensitive comparison? Moreover, in the same database, both such comparisons can be used for different fields.

In SQLite (and in any database), the notion of collation is introduced: a way to compare two strings to each other. There are standard (built-in) collation and you can create your own in any quantity.

Collation is, in essence, a method that receives strings A and B and returns one of three results:
"String A is less than string B",
"Lines A and B are equal",
"String A is greater than string B".

But that's not all. String comparisons must be transitive , otherwise it will “break” the search mechanisms that come from certain assumptions. More strictly: for all lines A, B and C it must be guaranteed that:
1. If A == B, then B == A.
2. If A == B and B == C, then A == C.
3. If A <B then B> A.
4 If A <B and B <C, then A <C.
If this is not the case (that is, you created a collation that violates some of the rules and use it), then “the behavior of SQLite is undefined” (“SQLite behavior is not defined”).

Usually collation is set for a table column and determines what type of comparison is used for the values of this column.

And when do you even need to use string comparison?
1. to create and update an index by string values;
2. in the process of calculating SQL expressions with string values with operations "=", "<", ">"
( ... WHERE name = 'Alice') .

So,
when SQLite needs to compare two strings, this comparison is always based on some collation.

If the collation of the compared strings do not match, then a clever mechanism of preference of one collation to another is used, which is better not to run up.

Standard (built-in) collation

Full support for comparing any UNICODE characters (case insensitive) requires quite a lot of additional data (tables). The SQLite developers did not “inflate” the kernel and built in the simplest methods of comparison.

There are three built-in collations:
BINARY : the usual byte comparison of two blocks of memory: good old memcmp ()
(used by default if no other collation is specified);
RTRIM : the same as BINARY, but ignores trailing spaces ('abc' = 'abc');
NOCASE : the same as BINARY, but ignores case for 26 letters of the Latin alphabet (and only for them).

Implementation of standard collation

Let's look “inside” SQLite and look at the implementation of comparison functions for different collation.

The function binCollFunc () . If padFlag <> 0 , it makes a comparison RTRIM , otherwise BINARY :

 /* ** Return true if the buffer z[0..n-1] contains all spaces. */ static int allSpaces(const char *z, int n){ while( n>0 && z[n-1]==' ' ){ n--; } return n==0; } /* ** This is the default collating function named "BINARY" which is always ** available. ** ** If the padFlag argument is not NULL then space padding at the end ** of strings is ignored. This implements the RTRIM collation. */ static int binCollFunc( void *padFlag, int nKey1, const void *pKey1, int nKey2, const void *pKey2 ){ int rc, n; n = nKey1<nKey2 ? nKey1 : nKey2; rc = memcmp(pKey1, pKey2, n); if( rc==0 ){ if( padFlag && allSpaces(((char*)pKey1)+n, nKey1-n) && allSpaces(((char*)pKey2)+n, nKey2-n) ){ /* Leave rc unchanged at 0 */ }else{ rc = nKey1 - nKey2; } } return rc; }

And here is the string comparison function for collation NOCASE :

 /* ** ** IMPLEMENTATION-OF: R-30243-02494 The sqlite3_stricmp() and ** sqlite3_strnicmp() APIs allow applications and extensions to compare ** the contents of two buffers containing UTF-8 strings in a ** case-independent fashion, using the same definition of "case ** independence" that SQLite uses internally when comparing identifiers. */ SQLITE_API int sqlite3_stricmp(const char *zLeft, const char *zRight){ register unsigned char *a, *b; a = (unsigned char *)zLeft; b = (unsigned char *)zRight; while( *a!=0 && UpperToLower[*a]==UpperToLower[*b]){ a++; b++; } return UpperToLower[*a] - UpperToLower[*b]; } SQLITE_API int sqlite3_strnicmp(const char *zLeft, const char *zRight, int N){ register unsigned char *a, *b; a = (unsigned char *)zLeft; b = (unsigned char *)zRight; while( N-- > 0 && *a!=0 && UpperToLower[*a]==UpperToLower[*b]){ a++; b++; } return N<0 ? 0 : UpperToLower[*a] - UpperToLower[*b]; } // UpperToLower -  ,    " "     'A'..'Z' (   'a'..'z')

What attracts attention? That string is supposedly in UTF-8 format, but all the characters in the string are processed in a row: there is no extraction and transcoding of characters in UTF-32.

If you meditate on this code for a long time, you can understand that, oddly enough, it works correctly on both UTF-8 strings and on any other single-character encoding (for example, windows-1251 ). Revelations in the understanding of "why so" can be divided in the comments :).

This smoothly brings us to the understanding of the important thesis that

SQLite is not interested in the actual format of the UTF-8 string until the time when the transcoding of the string into UTF-16 is required .

Of course, ordering real UTF-8 strings with standard collation will give rather strange results. But equality will work correctly, and the comparison will be transitive and will not disrupt SQLite.

It turns out, indeed, in the SQLite database you can store strings encoded in, say, windows-1251 , provided that you are not using UTF-16 anywhere. This also applies to string literals inside SQL, as well as strings passed through parameter bindings.

Arguments in favor of the format UTF-8 as the format of "default"

Let's take a look at the code of the sqlite3Prepare16 () API function that performs the parsing and compilation of the SQL statement. We are interested in the comment at the beginning of the body function.

 /* ** Compile the UTF-16 encoded SQL statement zSql into a statement handle. */ static int sqlite3Prepare16( sqlite3 *db, /* Database handle. */ const void *zSql, /* UTF-16 encoded SQL statement. */ int nBytes, /* Length of zSql in bytes. */ int saveSqlFlag, /* True to save SQL text into the sqlite3_stmt */ sqlite3_stmt **ppStmt, /* OUT: A pointer to the prepared statement */ const void **pzTail /* OUT: End of parsed string */ ){ /* This function currently works by first transforming the UTF-16 ** encoded string to UTF-8, then invoking sqlite3_prepare(). The ** tricky bit is figuring out the pointer to return in *pzTail. */ ... //  }

I.e,

A parser of SQL statements in UTF-16 format does not currently exist in the SQLite kernel (which does not exclude its appearance in the future) .

So, if a string with a SQL statement is transmitted in UTF-16 format, it is always first converted to UTF-8 format.

Thus, in favor of the UTF-8 format:
- there are no unnecessary conversions when compiling SQL;
- data (as a rule) occupy less disk space;
- it is possible to store data in any “byte-by-character” encoding if UTF-16 is not used anywhere (and written collations take into account the new string format).

How to create and use your own collation

Use the API function sqlite3_create_collation_v2 () :

  int sqlite3_create_collation_v2( sqlite3*, //    const char *zName, //  int eTextRep, //      void *pArg, // custom- int(*xCompare)(void*,int,const void*,int,const void*), //    void(*xDestroy)(void*) );

In the eTextRep parameter, you need to specify in which format the lines are expected:

  SQLITE_UTF8 = 1; SQLITE_UTF16 = 2; SQLITE_UTF16BE = 3; SQLITE_UTF16LE = 4; SQLITE_ANY = 5;

You can set several functions for the same collation, but accepting different formats.

SQLite tries to choose a method for the format, otherwise (if the formats of the transmitted string and the registered function are different), the encoding is performed again on the fly. The comparison function should return a negative number if the first line is less than the second; zero - if the lines are equal; positive number if the first line is greater than the second. The comparison, as already mentioned, must be transitive.

Create a collation (named "RU"), which will give us:
- case-insensitive comparison for Cyrillic and Latin
- the usual comparison for all other characters;
- the correct position in the alphabet of the letter "" (more precisely, "" is considered to be equal to "e").

This, so far, is not a full-fledged support for UNICODE, but this is a simple solution that suits 95% of cases.

Examples will be on Delphi, do not be alarmed.

 unit UnicodeUnit; interface //        UTF-8    UTF-32   -1,     function GetCharUTF8(var P: PByte; var L: integer): integer; //    UTF-32    (    !) function LowerUTF32(ACode: integer): integer; //   UTF-8  (case-insensitive    ) function CompareUTF8IgnoreCase(L1: integer; P1: PByte; L2: integer; P2: PByte): integer; implementation uses SysUtils; //    UTF-8 type TParseItem = record Mask: integer; Count: integer; end; var //    UTF-8 ParseTable: array[0..255] of TParseItem; //      LowerTable: array[0..65535] of integer; function CompareUTF8IgnoreCase(L1: integer; P1: PByte; L2: integer; P2: PByte): integer; var C1, C2: integer; begin repeat if (L1 = 0) and (L2 = 0) then begin result := 0; exit; end; //     C1 := GetCharUTF8(P1, L1); if C1 < 0 then begin result := -1; exit; end; C2 := GetCharUTF8(P2, L2); if C2 < 0 then begin result := +1; exit; end; //         if C1 < 65536 then C1 := LowerTable[C1]; if C2 < 65536 then C2 := LowerTable[C2]; if C1 < C2 then begin result := -1; exit; end else if C1 > C2 then begin result := +1; exit; end; until false; end; function LowerUTF32(ACode: integer): integer; begin case ACode of 1105, 1025: // ,  result := 1077; //  1040..1071: result := ACode + 32; //  Ord('A')..Ord('Z'): result := ACode + 32; //  else result := ACode; end; end; function GetCharUTF8(var P: PByte; var L: integer): integer; var B: byte; I: integer; begin if L > 0 then begin B := P^; Inc(P); Dec(L); with ParseTable[B] do if Count > 0 then begin //      result := B and Mask; //      for I := 1 to Count - 1 do if L > 0 then begin result := (result shl 6) or (P^ and $3F); Inc(P); Dec(L); end else begin result := -1; exit; end; exit; end; end; result := -1; end; var I: integer; initialization //    UTF-8 for I := 0 to 255 do with ParseTable[I] do if I <= 127 then begin Mask := $7F; Count := 1; end else if I >= 248 then begin Mask := 0; Count := 0; end else if I >= 240 then begin Mask := 7; Count := 4; end else if I >= 224 then begin Mask := 15; Count := 3; end else if I >= 192 then begin Mask := $1F; Count := 2; end else begin Mask := 0; Count := 0; end; //  Lower for I := 0 to 65535 do LowerTable[I] := LowerUTF32(I); end.

The code, in general, is simple and probably does not need additional explanations.

How to use:

 //       function RU_CollationCompare(UserData: pointer; l1: integer; p1: pointer; l2: integer; p2: pointer): integer; cdecl; begin result := CompareUTF8IgnoreCase(L1, P1, L2, P2); end; ... //     collation RU sqlite3_create_collation(Fdb, 'RU', SQLITE_UTF8, nil, RU_CollationCompare);

When creating a table, set the comparison type for the name column:

 CREATE TABLE foo( name TEXT COLLATE RU, ... )

Important! Registration of collation is performed in conjunction with the database (not in the database itself). This is, in essence, attaching your code to the SQLite code. Another connection that did not specify the same collation will not be able to use this base.

LIKE, upper (), lower (), etc.

The collation code “RU” is, of course, easy to modify to support other alphabets. And you can make a (almost) full-fledged UNICODE comparison if you use the Windows API calls to populate the LowerTable table.

Why "almost"? There is such a thing as “normalization” of UNICODE, that is, a reduction to an unambiguous representation. Google to the rescue!

However, UNICODE support is not only in collation alone. There is still:
- operator LIKE (pattern matching);
- SQL functions lower () and upper () (translate characters of the string to lower / upper case).

And some other string manipulation functions: fold (), title () , etc.

These are separate mechanisms that do not use collation.

SQLite, however, allows you to override these functions (LIKE is also a function) with your own.

To do this, use the sqlite3_create_function_v2 () API call.

UNICODE almost full of little blood

In previous articles, ICU: International Components for Unicode libraries were mentioned. This is full UNICODE support. The trouble, however, is that it drags a huge amount of code and data, which in 95% of cases is not needed. If your SQLite already has a built-in ICU, then you can not read further.

So, there was one clever guy who sawed out of this code is not necessary and created a kind of "residue" from the ICU.

His original message apparently is this .

This is about that. Based on the ICU code, he created the file sqlite3_unicode.c , which is compiled into a DLL (usually it is sqlite3u.dll ). The resulting DLL exports the sqlite3_unicode_init () function:

 function sqlite3_unicode_init(db: TSQLiteDB): Integer; cdecl; external 'sqlite3u.dll' name 'sqlite3_unicode_init';

If you call this function for a connection, then it:
- register almost full-fledged UNICODE-versions of the functions lower, upper, fold, title, unaccent;
- will introduce almost full UNICODE case-insensitive LIKE.

The size of this DLL is only 80 Kb and it works quickly (tables are used). The reservation “almost” is important - this is not a full-fledged UNICODE, but in 95% of cases this library will suffice.

Notes.

1. if LIKE is overridden, then SQLite will not be able to optimize it with an index (A LIKE 'XXX%' will not use an index with A, if it exists).

2. functions lower (), upper (), etc., generally speaking, are not required to be in the base engine, it is possible to perform these transformations in the application code.

Yuz zis laybrari et er oyun risk , that is, the author of this article is never responsible for anything.

Source: https://habr.com/ru/post/150543/

All Articles