Multilanguage support in CMS: UTF8 Problem

Talk about writing modules and plugins for CMS Made Simple, or about specific core functionality. This board is for PHP programmers that are contributing to CMSMS not for site developers
intersol

Multilanguage support in CMS: UTF8 Problem

Post by intersol »

I open now this topic because it seamns that UTF8 support in CMS it's not quite simple. I have a quite good experience into the multilanguage/localization (l10n) / internetionalization (i18n) problem with my own software and with other opensource projects and here it's my conclusion:
- the easy way (good from my point of view it's to use UTF8 as DEFAULT encoding because English text in UTF it's the same as 8859-1, or latin1 and you can write text in ANY language in the same page.
- todo: change default header to include UTF-8 directive.
- moderm browsers have no problem with UTF-8
- at this moment it seams that UTF8 support it's partil, I have big problems inserting UTF8 characters into titles / aliases !!!
- i think that SVN version should use UTF8 as default encoding.

Database support:
- an important task it's to create the mysql tables with UTF8 ! at this moment they are created using default codepage and this could create problems in the future: when the most-wanted feature: "search on site" will appear. Searching into files MUST be made using database engine because database engines can ignore the presece of the special puctuation live grave, acute, cedilla, etc. YOU DO NOT WANT TO IMPLEMENT THIS BY YOURSELF, TRUST ME!
- MySQL 4.1  adds full support for UTF8, so there is still hope for those who use an onder version

Other remarks:
- Why we are not using the gettext for translation of the interface? It's the best way from my experience.
Ted
Power Poster
Power Poster
Posts: 3329
Joined: Fri Jun 11, 2004 6:58 pm
Location: Fairless Hills, Pa USA

Re: Multilanguage support in CMS: UTF8 Problem

Post by Ted »

CMSMS is UTF-8 by default.  That's what it sends in the header.  I just did a fresh install of the svn version, looked at the front page, did a Page Info in Firefox and it was UTF-8.  If no encoding override is put anywhere (config.php or in the template), it's UTF-8.  Obviously, we don't do anything with meta tags because people make their own templates, but the headers are what really count on any newer browser anyway.

The problem with UTF8 characters in aliases HAS to be conformant to the characters supported by URI/URL, which definatly is not UTF-8.  There are used in URLs only...

As for title's and stuff, I'm not aware of a problem.  I've never heard any complaints, though most people of European languages change the encoding to something else.  Can you give me an example of a title that's UTF-8 encoded and I'll see what the problem is.

As far as MySQL being set to UTF-8 by default, that's a good idea.  I'll add that to my TODO list, though I have no idea how that's going to affect people using other encodings (what if people search on their ISO-8859-15 site?).

As for gettext.  CMSMS used it originally, but there are two problems.  1. Gettext is not enabled in PHP by default in many distributions and many people surprisingly don't have the ability (notice that most big PHP apps roll their own translation code -- believe me I researched this thoroughly).  2. There is a PEAR wrapper that does gettext with PHP only text, but PEAR is VERY inconsistent across default PHP installs as well.  It caused a WHOLE lot of hassles.  I was pretty much forced to write something on my own, though I agree, gettext syntax is better (though it does have a much steeper learning curve for newbie translators -- particular those in the Windows world).

Hope that helps...
intersol

Re: Multilanguage support in CMS: UTF8 Problem

Post by intersol »

Found the bug:
In the admin section - when you edit a page it doesn't load Title and Menu text in UTF-8.
Remark: If you wirte UTF-8 it will be saved corectly and displayed in the app but the LOAD of the form does no work. So if i ree-dit the page it will scramble the text.

... waiting for the SVN update :)
Ted
Power Poster
Power Poster
Posts: 3329
Joined: Fri Jun 11, 2004 6:58 pm
Location: Fairless Hills, Pa USA

Re: Multilanguage support in CMS: UTF8 Problem

Post by Ted »

What could possibly be causing that?  Is the title and menu text showing up properly in the front end?
intersol

Re: Multilanguage support in CMS: UTF8 Problem

Post by intersol »

Nope, in the admin interface they are loaded as single byte even if those characters are double-byte.
PS. Still in the HTML visual editor i do not get this problem. So if they look good when i use them in the site there is no problem with the database but with the form.
PS. I tested with Firefox too.
intersol

Re: Multilanguage support in CMS: UTF8 Problem

Post by intersol »

New info: i've looked to the generated page and it does contains something very strange: the strage characters are encoded as HTML escaped sequences, it seams that at a point you encode the texts as HTML but with a function that doesn't recognize UTF8.

PS. By default PHP functions are not using UTF8 !!! ... now i'm looking for the buggy piece of code ...
intersol

Re: Multilanguage support in CMS: UTF8 Problem

Post by intersol »

FOUND IT!
Take a look at: http://ro.php.net/htmlentities

WRONG:

Code: Select all

htmlentities($this->mName)
RIGHT:

Code: Select all

htmlentities($this->mName,ENT_NOQUOTES, 'UTF-8')
I think that there are a lot of places to be repaired.
Ted
Power Poster
Power Poster
Posts: 3329
Joined: Fri Jun 11, 2004 6:58 pm
Location: Fairless Hills, Pa USA

Re: Multilanguage support in CMS: UTF8 Problem

Post by Ted »

Fixed in svn.  Content types were using the wrong htmlentities function, that is not UTF8 friendly.  I'm surprised it hasn't been brought up before.  Thanks!
Ted
Power Poster
Power Poster
Posts: 3329
Joined: Fri Jun 11, 2004 6:58 pm
Location: Fairless Hills, Pa USA

Re: Multilanguage support in CMS: UTF8 Problem

Post by Ted »

I posted that before I read your response...  I have a cms_htmlentities function that doesn't do anything with htmlentities.  It's some custom code I dug off of a forum one time and works like a charm!
roman
Forum Members
Forum Members
Posts: 77
Joined: Thu May 12, 2005 9:38 am
Location: slovakia

Re: Multilanguage support in CMS: UTF8 Problem

Post by roman »

I had this problem: When i§m using characters like šáé... in menu text or title of document, and when i like to edit for second time this document, this characters is wrong.....and homepage is also wrong.  Change of encoding changed nothing. I cut away  htmlentities() functions from code in 2 files in lib/contenttypes/ directory - Link.inc.php line: 121,122,123 and file Content.inc.php, line: 164,165.  {old code: value="'.htmlentities($this->mMenuText).'"    new code: value="'.($this->mMenuText).'"    } Now looks good, but i don't now, if some security issues don't come into existence. :).
roman
Forum Members
Forum Members
Posts: 77
Joined: Thu May 12, 2005 9:38 am
Location: slovakia

Re: Multilanguage support in CMS: UTF8 Problem

Post by roman »

I'm sorry, something in this think is wrong - CMS after this was crash.
henkkag

Re: Multilanguage support in CMS: UTF8 Problem

Post by henkkag »

intersol wrote: New info: i've looked to the generated page and it does contains something very strange: the strage characters are encoded as HTML escaped sequences, it seams that at a point you encode the texts as HTML but with a function that doesn't recognize UTF8.

PS. By default PHP functions are not using UTF8 !!! ... now i'm looking for the buggy piece of code ...
My questiion is whenewer it would make sense to change page aliases handling into md5 sums, it should not change too much, hovewer, pages could not anymore be adressed with the old syntax ...index.php?page="Home" but the cms_selflink function could be upgraded to take this into account. On the positive side this would mean; no hassle with UTF-8 coding, better security with ppages.
Ted
Power Poster
Power Poster
Posts: 3329
Joined: Fri Jun 11, 2004 6:58 pm
Location: Fairless Hills, Pa USA

Re: Multilanguage support in CMS: UTF8 Problem

Post by Ted »

Well, the whole purpose of the alias is to make a url that people remember.  You can always set auto_alias_creation to false in you config.php and just type it in manually.
mbvdk
Forum Members
Forum Members
Posts: 43
Joined: Wed Jun 08, 2005 3:30 pm

Re: Multilanguage support in CMS: UTF8 Problem

Post by mbvdk »

I have a related problem, if I switch language in the admin-interface I also change character encoding, at least for menues, titles, headders, etc. in the interface, and thus I get troubles with elements on the front-end.
An excample is that the English admin interface I use UTF-8, but then ehen I switch to danish I suddenly use ISO-8859-1. In both cases the interface let's me type non-english characters, but they do not display the same on the front-end.

I think the correct way to fix this is to have a global character encodig for both frontend and admin. Unless you have a multillanguage page I don't se any reason to have seperate character encodig for the two, and even then I think it is more trouble than it's worth.
Ted
Power Poster
Power Poster
Posts: 3329
Joined: Fri Jun 11, 2004 6:58 pm
Location: Fairless Hills, Pa USA

Re: Multilanguage support in CMS: UTF8 Problem

Post by Ted »

I would love to use UTF-8 across the board, and while it wouldn't cause much of an issue with the Western European languages, it's going to cause problems with others.  Until operating systems start defaulting to using UTF-8 in their input systems, it's going to cause issues.

If there is an all-around better approach to handling this, I'd love to know.  I could be missing something.
Post Reply

Return to “Developers Discussion”