Ted wrote:
I'm concerned that the set names command could screw up existing installations, so I'd like to figure out a way to solve it for all new installs instead.
Definitely, it can happen. The "set names" can screw up the installations where the character set it not set properly from the very beginning. It will impact databases that are already "damaged" because of false set name command. Let's omit collation now and deal only with character set since collation is only needed for comparison:
Code: Select all
CREATE DATABASE mydb
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_general_ci;
This means that tables created in the database will use utf8 and utf8_general_ci by default for any character columns.
But it does NOT mean that the data sent to it is in utf-8! Applications that use the database should configure their connection to the server each time they connect or the default it used. In our case PHP has to inform the server that the data coming is UTF-8 data. It can be done with the command "SET names utf8".
If it is done, we are happy and can be sure that everything will be OK. But what if not? Some servers have another character set defined for connection, eg. latin1 and if no other command is given the connection will be established with this value. Now, what happens in this case step by step:
1. We declare UTF-8 for our page with meta tag encoding and type in text area some special characters: „üüü“.
2. The MySQL client knows nothing about UTF-8 at that time and sends „üüü“ as „üüü“ (latin1 representaion of „üüü“) to the database with the comment that latin1 data comes.
3. The MySQL server knows that the table uses UTF-8 as default character_set and converts the received data from latin1 to utf-8. But it saves literally the string „üüü“ as UTF-8 data and not "ü".
4. Now we want to query the table. Since our connection is still latin1 the query will demand latin1 data.
5. The server delivers the data, but before delivering it converts the data from UTF-8 to latin1, just because the client wanted it so.
6. The client gets „üüü“ as latin1 and outputs it on the display as „üüü“.
In these 6 steps you can see how corrupted data can be saved and queried "without any problems" except that it is impossible to read the data in PHPMyAdmin, sort it and that the data is not saved properly.
The connection is established with utf-8 by default only if the MySQL server configured accordingly. See in my.cnf:
Code: Select all
default-character-set=utf8
character-set-server=utf8
If default-character-set is set to latin1 (and in my case it is so!) I need to advise the mysql client each time before sending and quering data to use utf-8 connection.
Ted wrote:
so tell me if I'm totally off-base.
I tried
To solve the problem for new installation, please see the post of vilkis:
it could be implemented an additional parameter of CMSMS: "forced ecncoding for mysql". During installation of CMSMS, if selected mysql db, some non latin characters can be written to db and read in two ways: with setting names to utf8 and without doing it. If users can see different answers then he can be invited to turn on a parameter "forced ecncoding for mysql".
For me, it it the best way to do it for new installation. However it will not suit existing "damaged" environments. Converting database is not so trivial. ZYV knows how, but he does not tell!
Hope, could help a little.