Uppercase and lowercase letters

I have gathered here some not very obvious facts about capital and small letters that a programmer may encounter in his work. Many of you translated the lines to “all capitalized” (uppercase), “all lowercase” (lowercase), “first capitalized, and the rest lowercase” (titlecase). The case comparison operation is even more popular. Globally, such operations can be quite non-trivial. The post is built in the form of a “collection of delusions” with counterexamples.

1. If I translate a string in uppercase or lowercase, the number of Unicode characters will not change.

Not. The text can be caught lowercase ligatures that do not correspond to one character in uppercase. For example, when translating to uppercase: ﬁ (U + FB00) -> FI (U + 0046, U + 0049)

2. Ligatures - a perversion, nobody uses them. If you do not take them into account, then I'm right.

Not. Some letters with diacritics do not have an exact match in another case, so you have to use the combined character. Say, in Afrikaans there is a letter ŉ (U + 0149). In upper case, it corresponds to a combination of two characters:

(U + 02BC, U + 004E). If you get transliterated Arabic text, you may encounter

(U + 1E96), which also does not have a one-character match in upper case, so you have to replace it with

(U + 0048, U + 0331). There is a letter in Vakhan

(U + 01F0) with a similar problem. You can argue that this is exotic, but there are 23,000 articles in Afrikaans on Wikipedia.
')

3. Well, well, but let's take the combined symbol (with modifying or combining code points) as one symbol. Then the length is still preserved.

Not. There is, for example, in German the letter "escet" ß (U + 00DF). When translated to uppercase, it turns into two SS characters (U + 0053, U + 0053).

4. Okay, okay, got it. We assume that the number of Unicode characters can increase, but not more than double.

Not. There are specific Greek letters, for example,

(U + 0390), which turn into three Unicode characters

(U + 0399, U + 0308, U + 0301)

5. Let's about titlecase. Everything is simple here: I took the first character from the word, translated it into the uppercase, took all the subsequent ones, translated it into the lowercase.

Not. Recall the same ligatures. If the word in lowercase begins with ﬂ (U + FB02), then in uppercase the ligature will turn into FL (U + 0046, U + 004C), but in the titlecase - into Fl (U + 0046, U + 006C). The same with ß, but theoretically, words cannot begin with it.

6. Again these ligatures! Well, we take the first character from the word, translate it into uppercase, if more than one character came out, then we leave the first one and the rest back in lowercase. Then it will definitely work.

It will not work. There is, for example, the digraph (U + 01F3), which can be used in the text in Polish, Slovak, Macedonian or Hungarian. In the uppercase it corresponds to the digraph Ǳ (U + 01F1), and in the titlecase - to the digraph ǲ (U + 01F2). There are still different digraphs . The Greek language will delight you with jokes from the ipohogrammen and progogrammeni (fortunately, this is rarely found in modern texts). In general, the options uppercase and titlecase for a character can be different, for them there are separate entries in the Unicode standard.

7. Good, but at least the result of converting a character case to uppercase or lowercase does not depend on its position in the word.

Not. For example, the Greek capital sigma Σ (U + 03A3) at the end of a word turns into lowercase ς (U + 03C2), and in the middle - into σ (U + 03C3).

8. Oh, okay, the Greek sigma will be processed separately. But in any case, the same character in the same position in the text is converted the same way.

Not. For example, in most languages with the Latin lowercase version for I (U + 0049) is i (U + 0069), but not in Turkish and Azeri. There the lower case variant for I is ı (U + 0131), and the capital variant for i is İ (U + 0130). In Turkey, because of this, in a variety of software, sometimes enchanting bugs are observed. And if you get a text in Lithuanian with accented accents, then, for example, the capital + (U + 00CC), which will turn not into ì (U + 00EC), but into

(U + 0069, U + 0307, U + 0300). In general, the result of the conversion also depends on the language. Most complex cases are described here .

9. What a mess! Well, let us now correctly convert to uppercase and lowercase. Compare two words without registering is not a problem: translate both in lowercase and compare.

Here, too, a lot of pitfalls that arise from the foregoing. For example, it will not work with the German straße and STRASSE (the first will not change, the second will turn into strasse). With many of the other letters described above, there will also be problems.

10. Hmmm ... Maybe then everything is in uppercase?

And this does not always work (albeit more often). But let's say if you get a STRA record

E (yes, there is a big escett in German and in Unicode), it does not coincide with straße. For comparisons, the letters are converted to a special Unicode table - CaseFolding , and according to it both ß and SS will turn into ss.

11. Aaaa, this is some kind of kapets!

Here I agree.

If someone does not display any characters, write me a personal message, I will replace it with a picture.

Source: https://habr.com/ru/post/147387/

All Articles