Page 1 of 1

alias in unicode

Posted: Sat Sep 27, 2008 8:42 am
by confiq
Hi guys
I've notice that with little modification on cmsms it's possible to have alias in unicode. If you edit manualy DB or by editing php files.
I've notice that in old version 1.3.1 i could easily have alias as unicode and it would work on every browser with apache. However in 1.4.1 each time when i try to write alias in unicode i get error "Maximum execution time". However if i change content_alias in cms_content table of mysql it works nice.
Again, in 1.3.1 it works without error....

So the questions are
1) why alias can't be in unicode? Even wikipedia is using unicode in URL and it's important for search engines to have URL in your native language
2) why in 1.3.1 is working ok but in 1.4.1 i'm getting error?

best regards

Re: alias in unicode

Posted: Sat Sep 27, 2008 4:47 pm
by confiq
Ok i fix it....

Code: Select all

diff -bupr 1.4.1/lib/classes/class.contentoperations.inc.php XXX/lib/classes/class.contentoperations.inc.php
--- 1.4.1/lib/classes/class.contentoperations.inc.php	2008-08-08 20:15:53.000000000 +0300
+++ XXX/lib/classes/class.contentoperations.inc.php	2008-09-27 19:33:20.000000000 +0300
@@ -833,7 +833,8 @@ class ContentOperations
 		{
 			$error = lang('aliasnotaninteger');
 		}
-		else if (!preg_match('/^[\-\_\w]+$/', $alias))
+#		else if (!preg_match('/^[\-\_\w]+$/', $alias)) #original, to change to unicode
+		else if (!preg_match('/^[\-\_\w{YOUR_NATIVE_LANG_PREG}]+$/', $alias))
 		{
 			$error = lang('aliasmustbelettersandnumbers');
 		}
diff -bupr 1.4.1/lib/misc.functions.php XXX/lib/misc.functions.php
--- 1.4.1/lib/misc.functions.php	2008-07-08 18:27:05.000000000 +0300
+++ XXX/lib/misc.functions.php	2008-09-27 19:33:10.000000000 +0300
@@ -1066,7 +1066,7 @@ function munge_string_to_url($alias, $to
 		$alias = strtolower($alias);
 	}
 		
-	$alias = preg_replace("/[^\w-]+/", "-", $alias);
+	$alias = preg_replace('/[^\-\_\w{YOUR_NATIVE_LANG_PREG}]+/', '-', $alias);
 	$alias = trim($alias, '-');
 
 	return $alias;
[confy@localhost Desktop]$ 
It's ugly code... you can change {YOUR_NATIVE_LANG_PREG} to your regex for your language....
Well, it's working on all browsers and it's working on my centos so i guess i'll keep it this way...
pity PHP5 is not supporting unicode complitly but there are rumors it will work on PHP6...
I really would like to know why devolpers of cmsms decided to have only ASCII in alias...

Re: alias in unicode

Posted: Thu Oct 02, 2008 5:49 pm
by confiq
Anyone? why this shouldn't work in next version?

Re: alias in unicode

Posted: Sat Dec 06, 2008 12:10 pm
by confiq
up...

Anyone? There is new version but i didn't see this patch applied...

Re: alias in unicode

Posted: Tue Jul 28, 2009 8:59 pm
by confiq
up!
We are working 1+ year this way and it's working like a charm (expect redirection that alias must be in ASCII).

Any reason why we don't see this in next version of CMSMS?

Re: alias in unicode

Posted: Wed Jul 29, 2009 10:57 am
by Sonya
1. Please submit your suggestion to feature list http://dev.cmsmadesimple.org/feature_request/list/6 if you would like to see it implemented.

2. It is important to note that only a very restricted sub-set of the US-ASCII printing characters are permitted in URLs. It might work with some browser/server combinations (and might even deliver the expected value), but it should never be used. It is a common fallacy that this permits international characters to be reliably transmitted. This is wrong.
Consider official documentation as well: http://www.w3.org/TR/html40/appendix/no ... scii-chars

Re: alias in unicode

Posted: Wed Jul 29, 2009 5:36 pm
by confiq
Hi Sonya
Thank you for reply :)
I thought that nobody like my patch :)

Anyway, you are somehow right. I also prefer to have ASCII in my URI but I can't say this to my clients. There are thousends of sites that support unicode in URI, one of them is famous Wikipedia.

If you continue reading the HTML4 specs, you will find this quote
We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases:

  1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
  2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).
Well yeah... Most of sites today are using UTF-8 :)
I'm also supprised that nobody reports using this patch...

regards

Re: alias in unicode

Posted: Wed Jul 29, 2009 6:28 pm
by Sonya
confiq wrote: Anyway, you are somehow right. I also prefer to have ASCII in my URI but I can't say this to my clients. There are thousends of sites that support unicode in URI, one of them is famous Wikipedia.
But it does not mean that we have to make the same illegal stuff only because thousands of websites do it :)
confiq wrote: Well yeah... Most of sites today are using UTF-8 :)
URI does not deal with encoding of the site. The problem is that there is no provision in the HTTP protocol for the browser to tell the the web server what encoding has been used in the URI, and none of the specifications related to HTTP/HTML define a default encoding.
confiq wrote: I'm also supprised that nobody reports using this patch...
I would not implement this patch for my customer just because if he ever encounters difficulties I must explain why I have used illegal code for his solution.