📜 ⬆️ ⬇️

Another implementation of case-insensitive search for Cyrillic characters in SQLite

Good day, Habrovchane!

The problem of searching for Russian characters in SQLite has long been a talk of the town, the reasons for its appearance are described in some detail here . However, there are quite common solutions to this problem, the most popular of which is to connect ICU, a library, with which you can implement a full Unicode search. But I wanted a shorter solution code in terms of the result that the search was:


As a result, the idea came to dig into the source code of SQLite and fix it a little. The toolkit was as follows: the System.Data.Sqlite assembly in which the SQLite Amalgamation source code is included (where all the kernel source code is stored in one file), Visual Studio 2017 and the test database.
')
To compare strings in the database core there is a function patternCompare. As you know, it implements the nocase comparison only for the first 128 UTF-8 characters, for which there are corresponding checks in the code. The idea was to write another small block of code that checks whether the character is Cyrillic. If you look at the UTF-8 table, then for familiar characters to the heart, positions from 0x400 to 0x45F are highlighted. Now we will create a test for the Cyrillic character and upper and lower case translation tables. Some code.

#define IsCyrillic(A)(A >= 0x400 && A <= 0x45F) static const unsigned short CyrillicUpper[128] = { 1024, 1045, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071, 1104, 1045, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114, 1115, 1116, 1117, 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, 1148, 1149, 1150, 1151 }; static const unsigned short CyrillicLower[128] = { 1024, 1177, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103, 1104, 1177, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114, 1115, 1116, 1117, 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, 1148, 1149, 1150, 1151 }; #define CyrillicToUpper(ch)(CyrillicUpper[(unsigned char)(ch&~0x400)]) #define CyrillicToLower(ch)(CyrillicLower[(unsigned char)(ch&~0x400)]) 

Things are easy, look at all the checks that the character is less than 0x80 and add another check for Cyrillicity. This happens in two places, when checking the first character and all the rest. I will not give the whole function of patternCompare, only these two places (already carefully corrected).

The first block of code:

  if( c<=0x80 ){ u32 cx; int bMatch; if( noCase ){ cx = sqlite3Toupper(c); c = sqlite3Tolower(c); }else{ cx = c; } while( (c2 = *(zString++))!=0 ){ if( c2!=c && c2!=cx ) continue; bMatch = patternCompare(zPattern,zString,pInfo,matchOther); if( bMatch!=SQLITE_NOMATCH ) return bMatch; } } else if (noCase && IsCyrillic(c)) { u32 cx; int bMatch; if (noCase) { cx = CyrillicToUpper(c); c = CyrillicToLower(c); } else { cx = c; } while ((c2 = Utf8Read(zString)) != 0) { if (c2 != c && c2 != cx) continue; bMatch = patternCompare(zPattern, zString, pInfo, matchOther); if (bMatch != SQLITE_NOMATCH) return bMatch; } } 

The second block of code:

  if( noCase && sqlite3Tolower(c)==sqlite3Tolower(c2) && c<0x80 && c2<0x80 ){ continue; } if (noCase && CyrillicToLower(c) == CyrillicToLower(c2) && IsCyrillic(c) && IsCyrillic(c2)) { continue; } 


The case-insensitive search for Russian and English text now works properly for the WHERE LIKE '% xxx%' Pattern but will not work for WHERE LIKE 'xxx%' if the field is indexed by the standard NOCASE function. So if you want the indexed search to work too, you have to tinker a little more. Let's look deeper into the library code and see that the sqlite3_strnicmp function is responsible for creating the index with c collation NOCASE. In general, there are two of them, one with the parameter responsible for checking the length of the text, the other without. We need the first. But only the text in it, unlike patternCompare, is not decoded from UTF-8, so we realize one more test that the first byte is the highest for the Cyrillic character encoded in UTF-8, and also we realize decoding of the Cyrillic character from UTF- 8 with converting it to lower case.

 static unsigned char IsCyrillcByte(unsigned char b) { return !((b >> 1) ^ 0x68); } static unsigned short ReadLowerCyrillic(const unsigned char* in) { unsigned char b1, b2; b1 = *in; in++; b2 = *in; return CyrillicLower[(unsigned char)(b1 << 6) | (b2 &~ 0x80)]; } 


Unfortunately, the sqlite3_strnicmp function had to be rewritten, since it was largely optimized for ASCII, now it looks like this:

 SQLITE_API int sqlite3_strnicmp(const char *zLeft, const char *zRight, int N){ register unsigned char *a, *b; unsigned short cLeft, cRight; if( zLeft==0 ){ return zRight ? -1 : 0; }else if( zRight==0 ){ return 1; } a = (unsigned char *)zLeft; b = (unsigned char *)zRight; for(;N > 0; N--) { if (*a <= 0x80 && *b <= 0x80) { if (*a != 0 && UpperToLower[*a] == UpperToLower[*b]) { a++; b++; } else { return UpperToLower[*a] - UpperToLower[*b]; } } else if (IsCyrillcByte(*a) && IsCyrillcByte(*b)) { cLeft = ReadLowerCyrillic(a); cRight = ReadLowerCyrillic(b); if (cLeft == cRight) { a += 2; b += 2; } else { return cLeft - cRight; } } else if (*a == *b) { a++; b++; } else { return *a - *b; } } return 0; } 


Everything, it is possible to water with sauce and to compile. It is also necessary at the end to perform a simple query to the database REINDEX NOCASE; to re-create all case-insensitive indexes.

Source: https://habr.com/ru/post/344478/


All Articles