Trying to go utf-8 all the way
Posted: Mon Oct 01, 2007 1:26 pm
I'm writing this with the following assumptions (any of which may be wrong!):
1. utf-8 is the best encoding choice for web documents.
2. cms made simple is designed to work with utf-8 encoding.
3. databases will be the most flexible if they have utf-8 encoding for default text and collation.
When I got started with cms made simple, I simply created a new mysql database and left all the defaults in place. Consequently, upon returning to look at the database structure later, I noticed that the default text encoding was latin1 and the collation was something along the lines of latin1_swedish. Apparently these are both the defaults for mysql 5.x.
Since I will eventually need to translate a site into several languages, and because the w3c has guilted me into believing that not going utf-8 will have consequences on end users, I wanted my entire database to use utf-8 encoding.
In phpmyadmin, you can change the encoding. Unfortunately, it doesn't seem to work. It'll change some, but not all, and it basically hoses a bunch of special characters. To get around this, I did the following:
1. I dumped an existing cmsms database to an sql file.
2. I searched and replaced all latin1 to be utf8.
3. I searched and replaced on a number of special characters (like smart quotes) to lessen potential problems.
4. I saved it as utf-8 encoding.
5. I made a new database in mysql with utf_unicode_ci as the default for text encoding and collation.
6. I tried to import the cleaned sql file into the new database.
It doesn't work. It fires a mysql error concerning exceeding a 1000 limit on a key. To fix this, I changed the cms_module_template table to use 165 for both varchars, instead of 200. This may come back to bite me later if someone makes a module with a really long name or a module template with a really long name, but for now it seems like a pretty harmless compromise.
After making this change, the sql file imports and all seems well. I still run into some special characters (copyright, trademark, m-dashes, accented vowels, etc.) that I need to change as I run across them. But at least the database is no longer a hodgepodge of latin1 and utf-8.
I don't know if such a process is advisable, but it seemed to me like the thing to do to make the database as flexible as possible moving forward, so I'm leaving this here in case anyone else feels like they want to do it. Also, if anyone has a better way, or a reason why this is a bad idea, please let me know.
1. utf-8 is the best encoding choice for web documents.
2. cms made simple is designed to work with utf-8 encoding.
3. databases will be the most flexible if they have utf-8 encoding for default text and collation.
When I got started with cms made simple, I simply created a new mysql database and left all the defaults in place. Consequently, upon returning to look at the database structure later, I noticed that the default text encoding was latin1 and the collation was something along the lines of latin1_swedish. Apparently these are both the defaults for mysql 5.x.
Since I will eventually need to translate a site into several languages, and because the w3c has guilted me into believing that not going utf-8 will have consequences on end users, I wanted my entire database to use utf-8 encoding.
In phpmyadmin, you can change the encoding. Unfortunately, it doesn't seem to work. It'll change some, but not all, and it basically hoses a bunch of special characters. To get around this, I did the following:
1. I dumped an existing cmsms database to an sql file.
2. I searched and replaced all latin1 to be utf8.
3. I searched and replaced on a number of special characters (like smart quotes) to lessen potential problems.
4. I saved it as utf-8 encoding.
5. I made a new database in mysql with utf_unicode_ci as the default for text encoding and collation.
6. I tried to import the cleaned sql file into the new database.
It doesn't work. It fires a mysql error concerning exceeding a 1000 limit on a key. To fix this, I changed the cms_module_template table to use 165 for both varchars, instead of 200. This may come back to bite me later if someone makes a module with a really long name or a module template with a really long name, but for now it seems like a pretty harmless compromise.
After making this change, the sql file imports and all seems well. I still run into some special characters (copyright, trademark, m-dashes, accented vowels, etc.) that I need to change as I run across them. But at least the database is no longer a hodgepodge of latin1 and utf-8.
I don't know if such a process is advisable, but it seemed to me like the thing to do to make the database as flexible as possible moving forward, so I'm leaving this here in case anyone else feels like they want to do it. Also, if anyone has a better way, or a reason why this is a bad idea, please let me know.