“It's better to lose a day, then fly in an hour” © Wings, legs, tail
Not so long ago, I was “lucky enough” to translate a medium-sized web site from one encoding to another. To be more precise from windows-1251 on UTF-8. Then another one - more, on the third I broke down, and following the correct principle of the above, I had to lose a handful of time to write a script to automate this process, but then, in an hour, I still flew.
Customizable script parameters are as follows:
Initial parameters:
SDIR = "/ usr / local / apache2 / htdocs / site.ru /" - initial site directory with a slash / at the end
SCP = "CP1251" - the initial (from) codepage for iconv
EXT = ". * \. (Htm [l] * | php [3] * | js | css) $" - file extensions for transcoding (such as .htm, .html, .php, .php3,. Will be used here. js, .css)
FCS = "windows-1251" - the name of the source codepage for replacing meta charset = in files
Target parameters:
DROP_STRUCT = true - takes values false, true, adjusting the condition: should the target directory be cleaned up at the start
')
DDIR = "/ usr / local / apache2 / htdocs / new.site.ru /" - target directory of the site with a slash / at the end (must exist)
DCP = "UTF-8" - target (to) codepage for iconv
TCS = "UTF-8" - the name of the target codepage for replacing the meta charset = in files
But actually the script itself:
#!/bin/bash
# --- CONFIG SECTION ---
# Source Dir's params
SDIR="/usr/local/apache2/htdocs/site.ru/" # with slash '/' in the end
SCP="CP1251" # codepage for 'iconv'
EXT=".*\.(htm[l]*|php[3]*|js|css)$" #files extensions for coding
FCS="windows-1251" # charset for replace
# Destination Dir's params
DROP_STRUCT=true # false, true
DDIR="/usr/local/apache2/htdocs/new.site.ru/" # with slash '/' in the end
DCP="UTF-8" # codepage for 'iconv'
TCS="UTF-8" # new charset
# --- END CONFIG SECTION ---
# Drop structure
#
if $DROP_STRUCT
then
rm -dfr $DDIR*
fi
# Make new copy
#
cp -aR $SDIR* $DDIR
# Flush miscoded files
#
find $DDIR -type f | grep -E "$EXT" | xargs -i rm -f {}
# Convert From To
#
find $SDIR -type f | grep -E "$EXT" | sed "s#$SDIR##" | xargs -i echo {} | \
while read f
do
iconv -c -f $SCP -t $DCP -o "$DDIR$f" "$SDIR$f"
# Revert MODE & OWNER
chmod `find "$SDIR$f" -maxdepth 0 -printf "%m"` "$DDIR$f"
chown `find "$SDIR$f" -maxdepth 0 -printf "%u:%g"` "$DDIR$f"
# Replace strings
perl -pi -e "s#content\s*\=\s*[\"'].*?charset\s*=\s*$FCS.*?[\"']#content=\"text/html; charset=$TCS\"#g" "$DDIR$f"
done
And now, some more useful moments.
1. Perhaps even after transcoding to UTF-8 and replacing meta content with charset = UTF-8, you still see gibberish or not what you would like. The point here is that for a new site in UTF-8, you must replace the default_charset parameter for PHP itself, since in global variables, it is explicitly set for another code page (windows-1251). I do this in the virtual host settings (httpd.conf) via:
php_admin_value default_charset UTF-8
2. As a rule, now any site wants databases that you also need to translate into UTF-8. This does not make much effort, if you have phpMyAdmin or mysqldump on hand, in case of emergency, for giant databases, you will probably have to write a conversion script and temporarily suspend the service. The simplicity of the idea should be clear: we dump the database, reencode it using the same iconv and replace everything related to the code pages with the desired data, fill everything into a new database.
An even more correct version of the
proposed 4m @ t! C is to do this
on the database under test using ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;
A small abracadabra can also come out from the database, which is manifested in the incorrect display of the Russian "w" and. Here, the default codepage for MySQL will also play with us. To fix this problem, after connecting to the database, you will have to add the following lines to your site code:
mysql_query ("SET NAMES 'utf8'");
Or change the default-character-set and default-collation for MySQL, if this is allowed.
Remember !!! Approach such translations seriously, first performing them on a parallel version of the site and test, test, test.
Successful translations!
Source:
Notes on hand