Problem
To work with the MSSQL Server 2005 database in UTF-16 (UCS2) encoding, I use a script written in python. This script uses the following set of tools for working with the database:
- unixODBC
- FreeTDS
- pyodbc
- sqlachemy
And then there was a difficulty: when retrieving string data from the database (nvarchar, ntext fields), Unicode is incorrectly processed.
As it turned out, the python installed by me was compiled with UCS4 unicode. Methods for obtaining a unicode type in a python assembly are well described in this
question on stackoverflow . Ie, if you run the following line in the terminal:
python -c "import sys;print 'UCS4' if sys.maxunicode > 65536 else 'UCS2'"
then we get the unicode build version for python. In my case it was
UCS4 . What is behind it pulls:
- unixODBC calling the appropriate database functions with an appendix W (for example, SQLExecDirectW () ), gets the results. in which one character of the text occupies 2 bytes (UCS2)
- pyodbc receives the results from the ODBC driver, and in turn saves the results to a variable of type unicode.
- Thus 1 result symbol, according to pyodbc, is 4 bytes (UCS4). That is how the result is saved. derived from the ODBC driver.
The driver returns data in which the character takes 2 bytes, and pyodbc rewrites this data so that the character takes 4 bytes. It would be nice if there was any conversion, but the data is simply stored as an array of bytes into a variable with a
unicode type, which has unpleasant consequences: the result symbol essentially contains 2 characters of the result that the ODBC driver returned.
Not terrible consequences, I decided, and independently transformed the result, dividing it by symbols:
def odbcUCS4toUCS2(ustr): u = u"" for i in range(0, len(ustr)): u32 = ord(ustr[i]) u16 = [(u32 & 0xFFFF0000) >> 16, (u32 & 0x0000FFFF)] u += unichr(u16[1]) u += unichr(u16[0]) return u
This was not enough. Another unpleasant consequence has appeared:
if the length in the characters of the result is odd, then the last character of the result is truncated . Those. the string
'this word' will be received in my script as
'these are words' . I could not solve this problem with transformations: the last two bytes of the result are actually absent due to incorrect saving of unicode. Then I came to the decision to rebuild python with unicode UCS2 support.
Decision
We collect, not forgetting to specify in options
- enable-unicode = ucs2 . When building a new python, you need to remember to put a bag of zlib1g-dev, otherwise there may be difficulties with installing packages with the help of pip.
Install and configure virtualenv:
$ sudo apt-get install virtualenv $ virtualenv ~/ucs2env -p [ ucs2 ]
Well, add to aliases:
echo "alias ucs2env='source ~/ucs2env/bin/activate'">>~/.bashrc
Well that's all. Now the data is obtained in the same way as it is stored in the database:
$ ucs2env (ucs2env)$ python -c "import sys;print 'UCS4' if sys.maxunicode > 65536 else 'UCS2'" UCS2