Trying to go utf-8 all the way

moorezilla · Post by **moorezilla** » Mon Oct 01, 2007 1:26 pm

I'm writing this with the following assumptions (any of which may be wrong!):

1. utf-8 is the best encoding choice for web documents.
2. cms made simple is designed to work with utf-8 encoding.
3. databases will be the most flexible if they have utf-8 encoding for default text and collation.

When I got started with cms made simple, I simply created a new mysql database and left all the defaults in place. Consequently, upon returning to look at the database structure later, I noticed that the default text encoding was latin1 and the collation was something along the lines of latin1_swedish. Apparently these are both the defaults for mysql 5.x.

Since I will eventually need to translate a site into several languages, and because the w3c has guilted me into believing that not going utf-8 will have consequences on end users, I wanted my entire database to use utf-8 encoding.

In phpmyadmin, you can change the encoding. Unfortunately, it doesn't seem to work. It'll change some, but not all, and it basically hoses a bunch of special characters. To get around this, I did the following:

1. I dumped an existing cmsms database to an sql file.
2. I searched and replaced all latin1 to be utf8.
3. I searched and replaced on a number of special characters (like smart quotes) to lessen potential problems.
4. I saved it as utf-8 encoding.
5. I made a new database in mysql with utf_unicode_ci as the default for text encoding and collation.
6. I tried to import the cleaned sql file into the new database.

It doesn't work. It fires a mysql error concerning exceeding a 1000 limit on a key. To fix this, I changed the cms_module_template table to use 165 for both varchars, instead of 200. This may come back to bite me later if someone makes a module with a really long name or a module template with a really long name, but for now it seems like a pretty harmless compromise.

After making this change, the sql file imports and all seems well. I still run into some special characters (copyright, trademark, m-dashes, accented vowels, etc.) that I need to change as I run across them. But at least the database is no longer a hodgepodge of latin1 and utf-8.

I don't know if such a process is advisable, but it seemed to me like the thing to do to make the database as flexible as possible moving forward, so I'm leaving this here in case anyone else feels like they want to do it. Also, if anyone has a better way, or a reason why this is a bad idea, please let me know.

Pierre M. · Post by **Pierre M.** » Tue Oct 02, 2007 6:15 pm

Hello,

I don't know much about encoding but I have same assumptions as you. May be we are wrong together !-)

moorezilla wrote: ... I wanted my entire database to use utf-8 encoding.
In phpmyadmin, you can change the encoding.

I'm no DBA but I think the encoding in the database engine/instance has to be set deeply before anything else. I'm not surprised you can't change such a setting with some clics with data already stored.

Pierre M.

moorezilla · Post by **moorezilla** » Tue Oct 02, 2007 6:31 pm

Yep... as it was explained to me, setting the default encoding is just that... default. So data can really be stored in any encoding regardless of the default or collation settings. This is why I dumped first, cleaned the data, and saved it with an encoding of utf-8 before I put it back into a new database that was set up with default utf-8 encoding and collation.

Now... and this is pure speculation... but I'm thinking since all of the php in cms made simple has a utf-8 encoding set in config.php, that all will remain utf-8 from now on. The last thing I did was to add utf-8 to my templates ie.:

Code: Select all

<?xml version="1.0" encoding="utf-8"?>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

I played around with the idea of setting my php.ini to only use utf-8, but I think this is overkill.

Additional disclaimer: I think this is good policy, but it might not be, so take all of it with a grain or two of salt.

alby · Post by **alby** » Tue Oct 02, 2007 7:14 pm

moorezilla wrote: To fix this, I changed the cms_module_template table to use 165 for both varchars, instead of 200. This may come back to bite me later if someone makes a module with a really long name or a module template with a really long name, but for now it seems like a pretty harmless compromise.

I don't understand, Why you changed from 200 to 165 for varchars?

Alby

moorezilla · Post by **moorezilla** » Tue Nov 13, 2007 11:14 am

utf-8 requires more space per character than does latin-1, so mysql will fire an error unless you lower the varchar from 200 to 165. I forget the exact error, but if you put the error code into google with mysql, there are tons of reports about it on the web.

Peciura · Post by **Peciura** » Mon May 18, 2009 1:44 pm

utf-8 takes 3 bytes for one symbol, and key in MySQL can not be longer than 1000 bytes. (So MySQL swears about it 'Specified key was too long; max key length is 1000 bytes'). If key consists of 2 columns they both shouldn't exceed lenth of 1000/3 = 333 symbols.

If you have working data base you can convert it to utf-8 you can find good instructions here http://www.haidongji.com/2008/11/11/convert-character-set-to-utf8-in-mysql/. Thanks to Haidong Ji ;D ;D.

To avoid problams in the future edit MySQL config file (on Win) my.ini or (on linux) my.cnf as follows
[client] default-character-set=utf8 [mysqld] skip-character-set-client-handshake character-set-server = utf8 collation-server = utf8_general_ci #init-connect = SET NAMES utf8
'skip-character-set-client-handshake' is self explanatory, but be sure all input and output is utf-8
'init-connect = SET NAMES utf8' is inactive in my case. It would start every queries setting character set to utf-8. New config variable ( $config['set_names'] = true; ) works in similar way.

And last but not least, in my case i had data bases and 2 years old data all in latin1. National symbols looked like this:
ū Å«
ė Ä—
ž Å¾
ų Å³
š Å¡
Ž Å½
į Ä¯
ą Ä…
ę Ä™
Ę Ä˜
Į Ä®
So i had to modify (find-replace all) MySQL backup file with text editor. I have tried 'programers notepad', PSpad, notepad++... but they all failed to do good job. I finally found utf-8-binary editor UltraEdit which did the job. If you save list of symbols you have dealt with to use it to fix other data bases some patterns have to be found again. My guess is, correct bit sequence of strange patterns (Å) is stored in clipboard as long as you didn't closed the document. and dont forget to look for symbols (in my case Ā and Â) to be be simply deleted.

EDIT: Since v1.6 new variable on '/config.php' forces DB to work with utf-8

$config['set_names'] = true;

If you don't have access to MySQL config try to uncoment

Code: Select all

//    $cmsdb->Execute('set names utf8'); // database connection with utf-8

on line 143 in "/include.php" of cmsms-1.6.6.

EDIT: Actuallyif your encoding is Latin1 you still can safely write all characters.
Here is a big ( 0.5 MB ) list of characters and their html codes
http://www.columbia.edu/kermit/utf8-t1.html
http://www.columbia.edu/kermit/csettables.html

CMS Made Simple Forums

Trying to go utf-8 all the way

Trying to go utf-8 all the way

Re: Trying to go utf-8 all the way

Re: Trying to go utf-8 all the way

Re: Trying to go utf-8 all the way

Re: Trying to go utf-8 all the way

Re: Trying to go utf-8 all the way