And what difference does Collation choose?

This article was prepared for students of the course "MS SQL Server Developer"
I want to share a story from one of the previous projects, which illustrates that Collation must be chosen very thoughtfully. And about what happens if this parameter is nevertheless chosen incorrectly, and what options exist for solving the problem.

First, a small introduction on what Collation is. In SQL Server, the Collation parameter tells the server how to sort and compare rows. For example, the lines “Apple” and “apple”. Are they different or not? It depends on the specified Collation. If the register is becoming less or less clear, then what to do with the example of “Christmas tree” and “Christmas tree”? Count them as the same or as different? This is all in Collation too.

The story happened in a project whose functionality is very similar to DropBox or Google Drive. It provides the ability to manage their synchronized folders and files on different machines, as well as the ability for other users to have access to this synchronized folder.

So, the story began with the fact that on Prod servers there were 75-90% of errors in the logs (see the screenshot below), and it is not clear where they came from, and what was their reason. The error was: “ReadWrtLst is not complete”. Next came the user's details and their folders.

The code quickly found a place that generated an error, but we couldn’t understand why it occurred and how to reproduce it. It was only clear that the error was somehow related to the fact that the user somehow managed to make another folder with the same name in your OS.

We have collected information on users for whom this error is issued. And here the first surprise was waiting for us: out of millions of users of the system, only 50 had this error. And these 50 users generate 90% of the error logs. Since the situation could not be reproduced, we decided to contact one of the users and find out why one of the folders was not synchronized with it. The folder looked the same to us as the others, the only difference was that it was called in the user's language using hieroglyphs. And the user was Japanese. By the way, among these 50 users, the Japanese were the majority.

Thanks to one of the developers of the team, we were able to reproduce the error. The error was that the operating system considered the folder names different, and SQL Server considered them the same because of the selected Collation.

Collation used in the project:
SQL_Latin1_General_CP1_CI_AS

A small digression on how to read Collation. (If you are familiar with it, feel free to skip it.)

So, there are several parts to Collation:

SQL - sorting options by SQL Server (SQL at the beginning of Collation) or Windows (then it would be just Latin1_ ...);
Latin1_General - locale or language used;
CP1 - code page - code page;
CI - Case Insensitive - case insensitive;
AS - Accent Sensitive - taking into account axons or diacritics, in other words, 'a' is not considered equal to 'ấ'.

This Collation was once the default Collation when installing SQL Server.

What options are there?

_KS - taking into account the Japanese characters of hiragana and katakana, if the option is not selected, then SQL Server will interpret the hieroglyphs of hiragana and katakana as the same.
_WS - taking into account the width of characters, if the parameter is not selected, then “Text” and “T ext” are considered the same lines.
_VSS - taking into account the characters for choosing the spelling option in Japanese, appeared from version 2017.
_UTF8 - allows you to store data in UTF8.

All text fields in the database used type NVARCHAR.
It turns out that, since the current Collation ignored the difference in the spelling of Japanese characters and the difference in character widths, SQL Server did not compare strings in the same way as the operating system did, which caused the problem, i.e. the user could create folders, could not add them to the system for synchronization. The same thing would happen later when comparing file names.

We began to think about how to solve this problem and change Collation.

Collation can be set at several levels:

SQL Server Instance
Database
Table
Field

At the same time, it is not recommended to have different Collation inside the database, because each time when comparing rows with different Collation, you will need to do the conversion using COLLATE, indicating to the server which comparison order it should use.

What options are there in a situation where it is clear that Collation is not selected very well?

Change Collation at DB level;
Change Collation at the field level (in our case there was no point in changing for the entire table);
Add the Varbinary field, into which to write a duplicate from the field with the name of the folder, and use it for comparison;
Tell users that there are restrictions on the support of characters in directory names.

The first option - changing Collation at the database level - is the most difficult. In the case of the database, it would be necessary to recreate the database and reload the data there. Since the system worked 24/7, this option was rejected immediately.

The second option about changing the field: the easiest option to implement it is to add a field with the desired Collation and transfer the data there. But then it will be necessary to change the code in the database that works with this field, and there was a lot of code in the database.

We liked the third option the most, since in theory it made the least changes, since the main field would continue to exist with the current Collation, and we would not have problems with its conversion, while all the necessary functionality in the form of accounting for the Japanese alphabet or wide characters would work. The downside is that it was necessary to make changes to the software part, but since this server part, this could be done.

The fourth option was the simplest in this case, because the total number of users was several million, and only 50 had a problem. However, if the application was actively used in Japan, this solution would be of little use.

After presenting the data to the management, it was decided to notify users that the software does not support a number of characters, and when used in the name of synchronized files and folders, the software may not work correctly. This is a temporary solution, because with further distribution, the number of users facing a similar problem will increase, and it will be necessary to change something using the first three options.

The best option for choosing Collation is based on your application requirements. If you want SQL Server to compare strings in the same way as the OS, then Collation is definitely incorrect by default. Unfortunately, such nuances are rarely visible at the start of a project when designing a system, but I hope that after reading the article you will remember the situation described and do not step on such a rake yourself.

Useful Collation Resources:
https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017
https://www.red-gate.com/simple-talk/sql/sql-development/questions-sql-server-collations-shy-ask/
https://www.virtual-dba.com/sql-server-collation/

Source: https://habr.com/ru/post/461231/

All Articles

And what difference does Collation choose?

More articles: