📜 ⬆️ ⬇️

UTF-8: Encoding and Decoding

The reason for understanding how UTF-8 works and what Unicode is like was made due to the fact that VBScript does not have built-in functions for working with UTF-8. And since I did not find a worker, I had to piste / add myself. Experience in my opinion is useful in any case. For better understanding, I'll start with the theory.

About Unicode


Before the advent of Unicode, 8-bit encodings were widely used, the main disadvantages of which are obvious:

So it was decided to create a single standard of “wide” encoding, which would include all the characters (and at first they wanted to include only ordinary characters, but then they changed their mind and began to add exotic ones). Unicode uses 1,112,064 code positions (more than 16 bits). Start duplicates ASCII, and then the rest of the Latin alphabet, Cyrillic, other European and Asian characters. For symbols, characters use a hexadecimal notation of the form “U + xxxx” for the first 65k and with a large number of digits for the rest.

About UTF-8


Once I thought that there is Unicode, but there is UTF-8. Later I found out that I was wrong.
UTF-8 is only a representation of Unicode in 8-bit form. Characters with codes less than 128 are represented by one byte, and since they repeat ASCII in Unicode, the text written only with these characters will be ASCII text. Characters with codes from 128 are encoded with 2 bytes, with codes from 2048 - 3, from 65536 - 4. So it would be possible to get up to 6 bytes, but nothing is encoded by them.
 0x00000000 - 0x0000007F: 0xxxxxxx
 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx
 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

')

Encode in UTF-8


The procedure is as follows:

Function EncodeUTF8(s) Dim i, c, utfc, b1, b2, b3 For i=1 to Len(s) c = ToLong(AscW(Mid(s,i,1))) If c < 128 Then utfc = chr( c) ElseIf c < 2048 Then b1 = c Mod &h40 b2 = (c - b1) / &h40 utfc = chr(&hC0 + b2) & chr(&h80 + b1) ElseIf c < 65536 And (c < 55296 Or c > 57343) Then b1 = c Mod &h40 b2 = ((c - b1) / &h40) Mod &h40 b3 = (c - b1 - (&h40 * b2)) / &h1000 utfc = chr(&hE0 + b3) & chr(&h80 + b2) & chr(&h80 + b1) Else '     UTF-16 utfc = Chr(&hEF) & Chr(&hBF) & Chr(&hBD) End If EncodeUTF8 = EncodeUTF8 + utfc Next End Function Function ToLong(intVal) If intVal < 0 Then ToLong = CLng(intVal) + &H10000 Else ToLong = CLng(intVal) End If End Function 


Decode UTF-8



 Function DecodeUTF8(s) Dim i, c, n, b1, b2, b3 i = 1 Do While i <= len(s) c = asc(mid(s,i,1)) If (c and &hC0) = &hC0 Then n = 1 Do While i + n <= len(s) If (asc(mid(s,i+n,1)) and &hC0) <> &h80 Then Exit Do End If n = n + 1 Loop If n = 2 and ((c and &hE0) = &hC0) Then b1 = asc(mid(s,i+1,1)) and &h3F b2 = c and &h1F c = b1 + b2 * &h40 Elseif n = 3 and ((c and &hF0) = &hE0) Then b1 = asc(mid(s,i+2,1)) and &h3F b2 = asc(mid(s,i+1,1)) and &h3F b3 = c and &h0F c = b3 * &H1000 + b2 * &H40 + b1 Else '   U+FFFF    c = &hFFFD End if s = left(s,i-1) + chrw( c) + mid(s,i+n) Elseif (c and &hC0) = &h80 then '    s = left(s,i-1) + chrw(&hFFFD) + mid(s,i+1) End If i = i + 1 Loop DecodeUTF8 = s End Function 


Links


Unicode on Wikipedia
Source code for ASP + VBScript

UPD : Handling of erroneous sequences and an error with type Integer, which is returned by AscW.

Source: https://habr.com/ru/post/138173/


All Articles