UTF-8: Encoding and Decoding

The reason for understanding how UTF-8 works and what Unicode is like was made due to the fact that VBScript does not have built-in functions for working with UTF-8. And since I did not find a worker, I had to piste / add myself. Experience in my opinion is useful in any case. For better understanding, I'll start with the theory.

About Unicode

Before the advent of Unicode, 8-bit encodings were widely used, the main disadvantages of which are obvious:

Only 255 characters, and even that part of them is not graphic;
The ability to open a document not with the encoding in which it was created;
Fonts must be created for each encoding.

So it was decided to create a single standard of “wide” encoding, which would include all the characters (and at first they wanted to include only ordinary characters, but then they changed their mind and began to add exotic ones). Unicode uses 1,112,064 code positions (more than 16 bits). Start duplicates ASCII, and then the rest of the Latin alphabet, Cyrillic, other European and Asian characters. For symbols, characters use a hexadecimal notation of the form “U + xxxx” for the first 65k and with a large number of digits for the rest.

About UTF-8

Once I thought that there is Unicode, but there is UTF-8. Later I found out that I was wrong.
UTF-8 is only a representation of Unicode in 8-bit form. Characters with codes less than 128 are represented by one byte, and since they repeat ASCII in Unicode, the text written only with these characters will be ASCII text. Characters with codes from 128 are encoded with 2 bytes, with codes from 2048 - 3, from 65536 - 4. So it would be possible to get up to 6 bytes, but nothing is encoded by them.

 0x00000000 - 0x0000007F: 0xxxxxxx
 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx
 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Encode in UTF-8

The procedure is as follows:

Every character is turned into Unicode.
We check from which character range.
If the character code is less than 128, then we add it to the result unchanged.
If the character code is less than 2048, then take the last 6 bits and the first 5 bits of the character code. We add 0xC0 to the first 5 bits and get the first byte of the sequence, and add 0x80 to the last 6 bits and get the second byte. We concatenate and add to the result.
Similarly, we can continue for large codes, but if the character is outside of U + FFFF, we will have to deal with UTF-16 surrogates.

Function EncodeUTF8(s) Dim i, c, utfc, b1, b2, b3 For i=1 to Len(s) c = ToLong(AscW(Mid(s,i,1))) If c < 128 Then utfc = chr( c) ElseIf c < 2048 Then b1 = c Mod &h40 b2 = (c - b1) / &h40 utfc = chr(&hC0 + b2) & chr(&h80 + b1) ElseIf c < 65536 And (c < 55296 Or c > 57343) Then b1 = c Mod &h40 b2 = ((c - b1) / &h40) Mod &h40 b3 = (c - b1 - (&h40 * b2)) / &h1000 utfc = chr(&hE0 + b3) & chr(&h80 + b2) & chr(&h80 + b1) Else '     UTF-16 utfc = Chr(&hEF) & Chr(&hBF) & Chr(&hBD) End If EncodeUTF8 = EncodeUTF8 + utfc Next End Function Function ToLong(intVal) If intVal < 0 Then ToLong = CLng(intVal) + &H10000 Else ToLong = CLng(intVal) End If End Function

Decode UTF-8

We are looking for the first character of the form 11xxxxxx
We count all subsequent bytes of the form 10xxxxxx
If a sequence of two bytes and the first byte of the form 110xxxxx, then we cut off the prefixes and add them, multiplying the first byte by 0x40.
Similarly for longer sequences.
Replace the entire sequence with the desired Unicode character.

 Function DecodeUTF8(s) Dim i, c, n, b1, b2, b3 i = 1 Do While i <= len(s) c = asc(mid(s,i,1)) If (c and &hC0) = &hC0 Then n = 1 Do While i + n <= len(s) If (asc(mid(s,i+n,1)) and &hC0) <> &h80 Then Exit Do End If n = n + 1 Loop If n = 2 and ((c and &hE0) = &hC0) Then b1 = asc(mid(s,i+1,1)) and &h3F b2 = c and &h1F c = b1 + b2 * &h40 Elseif n = 3 and ((c and &hF0) = &hE0) Then b1 = asc(mid(s,i+2,1)) and &h3F b2 = asc(mid(s,i+1,1)) and &h3F b3 = c and &h0F c = b3 * &H1000 + b2 * &H40 + b1 Else '   U+FFFF    c = &hFFFD End if s = left(s,i-1) + chrw( c) + mid(s,i+n) Elseif (c and &hC0) = &h80 then '    s = left(s,i-1) + chrw(&hFFFD) + mid(s,i+1) End If i = i + 1 Loop DecodeUTF8 = s End Function

Links

Unicode on Wikipedia
Source code for ASP + VBScript

UPD : Handling of erroneous sequences and an error with type Integer, which is returned by AscW.

Source: https://habr.com/ru/post/138173/

All Articles