Pretty urls and duplicate content

Help with getting the CMS CORE package up and running. This does not include 3rd party modules, PHP scripts, anything downloaded via module manager or from any external source.
alastair_scs
Forum Members
Forum Members
Posts: 23
Joined: Sun Aug 17, 2008 6:41 pm

Pretty urls and duplicate content

Post by alastair_scs »

Hi

I have been using CMSMS for some time now and love it.  Thank you very much to those who have been involved in developing it.

I can't program to save my life and so can't contribute that way.  I do know a little about SEO though and am hoping that in finding a solution to this problem on my site, some other users may benefit.

I recently used pretty urls on an install for the first time, having previously used the internal version.  I have a difficulty with the amount of duplicate content that gets thrown up.  For example:

This page: http://www.domain.com/how-cmsms-works

can be accessed at the following urls:

http://www.domain.com/how-cmsms-works
http://www.domain.com/how-cmsms-works/
http://www.domain.com/index.php/how-cmsms-works
http://www.domain.com/index.php/how-cmsms-works/
http://www.domain.com/index.php?page=how-cmsms-works
http://www.domain.com/index.php?page=how-cmsms-works/

It is quite possible that the page is also accessible directly via it's page id?  I had assumed this was the case but can't get it to happen, so maybe not.

If we ignore the trailing slash issue on the assumption that visitors won't notice or care and search engines might be smart enough to figure it out for themselves, there are still 3 separate urls at which the same content can be accessed.

In the documentation, it says something to the effect of, "if you are putting pretty urls into an existing site don't worry about your previously advertised urls because they will continue to work".  This is exactly what I want to avoid. 

I can understand why someone would have a difficulty with this and would want old style urls to still function for visitors but it strikes me as a really bad idea.  From an SEO point of view the general principle would be that each page should have one and only one url.  Systematically creating duplicate content is a bad idea. 

Outside a CMS, if I had to change the url of a page and didn't want to just let the old page die naturally, the correct way to deal with it is to 301 redirect from the old url to the new.  The reason for this is to avoid committing search engine suicide with duplicate content.

I am hoping someone can help me with this.  Ideally I need to set this up so that it works the following way:

I choose one of:

http://www.domain.com/how-cmsms-works
http://www.domain.com/how-cmsms-works/

the other either 404's or redirects to the canonical form.

These url forms:

http://www.domain.com/index.php/how-cmsms-works
http://www.domain.com/index.php/how-cmsms-works/

should both give 404's.  This bit is probably the most important.

For those with existing urls, the ideal solution would be different and would involve 301 redirecting these url forms to the new url format. 

I am guessing that the index.php?page= format has to stay.  The index.php script has to be called somewhere (am I right?) but I can change "page" to something else in the config file.  I am hoping that when I change "page" to "cccyy", old urls of the form www.domain.com/index.php?page=page-alias will stop working and give a 404 error.

Basically I need all of the old url forms to 404 so that the search engines remove them from their index and I am left without any duplicate content problems.

Does anyone have any idea how I would achieve this?

On a separate but related issue, /index.php?page followed by anything is showing the homepage, presumably with a 200 OK server response, which is another duplicate content issue.

I can't overstate how bad an idea it is to have a system that cheerfully churns out duplicate content and I am really hoping that this is a result of some kind of setup error on my part.

I would really, really appreciate any help anyone can give in solving this problem.
Pierre M.

Re: Pretty urls and duplicate content

Post by Pierre M. »

Hello Alastair,
alastair_scs wrote: I can't overstate how bad an idea it is to have a system that cheerfully churns out duplicate content and I am really hoping that this is a result of some kind of setup error on my part.
I agree a webplace is nice simple URL to content mapping. I agree http has everything to avoid duplicate content : 302, 301, 401.

A pretty URL enabled CMSms website should not output a duplicate URL. I mean : the pretty URL switch is site wide. The bots should find and crawl either /some/pretty/url/to/page.html or /index.php?page=page but not both. I don't think they try /index.php?page=page when they find /some/pretty/url/to/page.html so they shouldn't see any duplicate content but the wanted nice simple mapping.
alastair_scs wrote: can be accessed at the following urls:
(...)
If we ignore the trailing slash issue on the assumption that visitors won't notice or care and search engines might be smart enough to figure it out for themselves, there are still 3 separate urls at which the same content can be accessed.
I think the trailing slash adding is bad. This is my opinion. Search my previous posts about "trailing slash[es]".
No ugly slash and no ugly index, so only 1 pretty URL per content with CMSms core (modules may be a problem).
Otherwise, yes, I agree : there is "some kind of setup error" of the part of the URL conscious webmaster.
alastair_scs wrote: In the documentation, it says something to the effect of, "if you are putting pretty urls into an existing site don't worry about your previously advertised urls because they will continue to work".  This is exactly what I want to avoid. 
(...)the correct way to deal with it is to 301 redirect from the old url to the new.  The reason for this is to avoid committing search engine suicide with duplicate content.
(...)
Basically I need all of the old url forms to 404 so that the search engines remove them from their index and I am left without any duplicate content problems.
The doc isn't perfect (but improving) and hopefully the software is better. The URL conscious webmaster is a poweruser who understands the trailing slash issue and tweaks correctly the webserver accordingly. BTW, feel free to improve the doc as a way to contribute ;)
I agree with you about "301 from old to new". This is webserver setting documentation language. (btw : there used to be a 'moved page' module, I don't know if it is maintained today).
I beg to differ on 404 for "dead" pages : The webserver doc suggests 401 (gone) instead.

See : all my speach is about the webserver tuning (once the CMSms core pretty URLs switch is on) not about CMSms.
alastair_scs wrote: On a separate but related issue, /index.php?page followed by anything is showing the homepage, presumably with a 200 OK server response, which is another duplicate content issue.
No 404 for this ? But these bad index.php URLs are in my above list of "shouldn't be found because not exposed".
Some information that should warm your heart :
-there are people working on improving pretty URL support in modules.
-there are people working on improving pretty URLs in the core.
-hence there are URL conscious people in the community.
-up to 1.2 (may be 1.3) the sample htaccess hadn't the trailing slash issue.

/hope/it/helps.html ;)

Pierre M.
alastair_scs
Forum Members
Forum Members
Posts: 23
Joined: Sun Aug 17, 2008 6:41 pm

Re: Pretty urls and duplicate content

Post by alastair_scs »

Thanks for the reply.

If I get the gist of your post correctly, you are suggesting that most of these problems are solveable with server configurations?
Pierre M. wrote: A pretty URL enabled CMSms website should not output a duplicate URL. I mean : the pretty URL switch is site wide. The bots should find and crawl either /some/pretty/url/to/page.html or /index.php?page=page but not both. I don't think they try /index.php?page=page when they find /some/pretty/url/to/page.html so they shouldn't see any duplicate content but the wanted nice simple mapping.
You are of course correct in this.  The problem arises if the search engine already knows about the old format urls.  If the search engine has /index.php?page=page in it's index and you switch to /category/page.html then you have duplicate content.  If there are external links to /index.php?page=page then the situation is even worse.

Consider this example though.  I install a site.  I start with normal urls cause I don't know what I am doing.  I discover pretty urls and switch on the internal ones because I haven't got the confidence to go into .htaccess stuff.  I like the idea of including hierarchies so I select that option.  My urls change to /index.php/category/page-alias

I'm happy with that but Google and the other search engines have come to crawl my site before I am ready, so they already have /index.php?page=page in their systems.  My new url is the same page with a different address.  At best it's confusing (for the poor little search engine).

To make matters worse, I use the "insert link to cmsms page" thingy in my tinyMCE editor to create a link.  Instead of pointing at /index.php/category/page-alias like I would assume it would though, this link points to /index.php/page-alias

Modern search engines don't wait for or pay any attention to submissions, they find content by following links.  If they follow the link I just inadvertantly created they end up with 3 pages with the same content.  All these urls return content as if the spider had made a perfectly valid and successful request.

Now, I finally switch to using mod_rewrite and SE friendly, user friendly pretty urls and I have got 4 different urls pointing at the same content.  Google has indexed each and every one of them.  If there are external or internal links pointing to any of the old versions Google will continue to come across this duplicate content time and again and it will always return 200 OK.

So, while it is true that there is only one version used at one time, there is plenty of room for situations where duplicate content exists and this is not at all good.

Pierre M. wrote:The doc isn't perfect (but improving) and hopefully the software is better. The URL conscious webmaster is a poweruser who understands the trailing slash issue and tweaks correctly the webserver accordingly. BTW, feel free to improve the doc as a way to contribute
From the sounds of things, you are suggesting that I can solve some or all of these problems with creative url rewriting.  Thank you for the suggestions.  I will give this a go.  Unfortunately, my url consciousness comes from 10 years in SEO.  It comes without any technical knowledge of url rewriting.  To make matters worse, I am on lighttpd and the rewriting engine seems to be different.

If anyone has been through this and found a solution, I would really appreciate the help.

Regardless of whether I can find a workaround with the server settings, it seems to me that the internal pretty urls should cease to work when they are turned off.  And the non hierarchical url /index.php/page-alias should never have worked (since I have the use hierarchy switched on).  To find that my content can be accessed through 6 or more different urls, regardless of the settings I choose is disturbing.
User avatar
duclet
Forum Members
Forum Members
Posts: 187
Joined: Fri Jun 23, 2006 12:55 pm

Re: Pretty urls and duplicate content

Post by duclet »

The URLs are the way it is because of a variety of reasons. The first is that some modules are able to set URLs that is specify to the module as if it is a regular page on the site. Also, the way the alias of the page is retrieved is just by looking at the last item in the URL and pretty much everything before that is ignored. That is why you can't have a a URL /one/index/ and /two/index/. Maybe the use of the hierarchy path instead might be better but I can already see problems with those modules urls.
Pierre M.

Re: Pretty urls and duplicate content

Post by Pierre M. »

Hello,
duclet wrote: Maybe the use of the hierarchy path instead might be better
I agree. CMSms is not designed this way today.
alastair_scs wrote: If I get the gist of your post correctly, you are suggesting that most of these problems are solveable with server configurations?
Yes. More precisely I mean a well made website can be made (from scratch) with CMSms core and a good rewriting webserver. I mean a website which exposes only pretty URLs to the web and 301 rewrites the others.
alastair_scs wrote: The problem arises if the search engine already knows about the old format urls.
I agree. But old URL => 301 (or 401) in webserver. This is basic webmastering and not CMSms specific.
alastair_scs wrote: To make matters worse, I use the "insert link to cmsms page" thingy in my tinyMCE editor to create a link.  Instead of pointing at /index.php/category/page-alias like I would assume it would though, this link points to /index.php/page-alias
(...)
So, while it is true that there is only one version used at one time, there is plenty of room for situations where duplicate content exists and this is not at all good.
Yes, again I agree there is plenty of room to improve CMSms' URL handling.
Yes, Tiny and other modules may generate bad links.

Meanwhile, my point is that right now every webmaster can SQL extract or crawl the listing of the aliases of its site and feed some 301 rewrite rules generator with it :
for each xyz alias of the list, the generator ouputs these 301 rules :
-/index.php?page=xyz 301 to /hierarchy/xyz
-/index.php/blah/xyz 301 to /hierarchy/xyz
-/index.php/xyz 301 to /hierarchy/xyz
-/xyz 301 to /hierarchy/xyz (unless it is a root URL with no hierarchy of course)
At the end the search engine "301 knows" the only valid and stable URL is /hierarchy/xyz

BTW, I'm on lighttpd too :-)
alastair_scs wrote: it seems to me that the internal pretty urls should cease to work when they are turned off. 
I agree. But remember with 301 redirects at the webserver level you can still expose a nice site to the web even when some things are suboptimal in CMSms.
alastair_scs wrote: And the non hierarchical url /index.php/page-alias should never have worked (since I have the use hierarchy switched on).
Yes. This is history. Once CMSms hadn't any pretty URL feature. The alias feature (as its name implies) was a shortcut.
alastair_scs wrote: To find that my content can be accessed through 6 or more different urls, regardless of the settings I choose is disturbing.
Again, I agree with you. I hope interested people will step in and :
-provide some listing generator (2 columns : naughty alias and nice hierarchy URL)
-provide some 301 redirect rules generator from previous listing.
-provide code and (modules) testing to turn my above workarounds unnecessary.

Have fun with CMSms, wish all happy lighty tuning

Pierre M.
alastair_scs
Forum Members
Forum Members
Posts: 23
Joined: Sun Aug 17, 2008 6:41 pm

Re: Pretty urls and duplicate content

Post by alastair_scs »

Thanks Pierre

I haven't had a chance to get back to this yet but I those suggestions look really useful.  If I can find anything that amounts to a solution, I will let you know.
Pierre M.

Re: Pretty urls and duplicate content

Post by Pierre M. »

Update : There are typos in my posts : http "gone" is 410, not 401.

Pierre M.
nhaack

Re: Pretty urls and duplicate content

Post by nhaack »

I was just thinking... what would be the best way for such a rewrite list generating tool to write the rules... directly write to the .htaccess file (is this possible from php) or can I hook up a list with rewrite entries to the .htaccess (sorry if this is a basic question)?

Are there generally accepted maximum reasonable .htaccess file sizes?

I haven't had the need for something like that as such URLs never appear on my site... though I actually use the query parameter version for ajax calls to pages... didn't had any indexing problems due to that as these URLs are in the javascript.

However, in the mentioned case, that you switch your URL structure, such a tool might come in handy... thus the questions ;D

Best
Nils
Pierre M.

Re: Pretty urls and duplicate content

Post by Pierre M. »

Hello,

the admin Content->Pages page generates the content hierarchy. Hence it could be stripped to make the generator and (why not, as you say) the .htaccess file itself directly. Not difficult for a PHP programmer.

Or one could use wget.

I agree a .htaccess should have not too much conditions and rules. It depends of the hosting context.

Pierre M.
escape2mtns
New Member
New Member
Posts: 2
Joined: Mon Feb 16, 2009 10:34 pm

Re: Pretty urls and duplicate content

Post by escape2mtns »

I have the same problem and solved by creating a plugin that checks the current page URL and if it doesn't match the 'CMSMS' url for the current page then it writes header redirect output (and/or outputs the new 'canonical' tag in the meta head)

I used the function.breadcrumbs.php as a base and modified for the following function.checkpageurl.php (to be placed in cmsms/plugins):

Code: Select all

function smarty_cms_function_checkpageurl($params, &$smarty)
{
    global $gCms;
    $redirect = false;

    $manager = &$gCms->GetHierarchyManager();
    $thispage = $gCms->variables['content_id'];

    $trail = "";

#Check if user has specified a redirect to proper url, otherwise use default
    if (isset($params['redirect'])) {
        $redirect = $params['redirect'];
    }   else {
        $redirect = false;
    }

    $endNode = &$manager->sureGetNodeById($thispage);

    if (isset($endNode))
    {
        $content =& $endNode->getContent();
        $contentURL = $content->getURL();
        $currentURL = curPageURL();
         if ($contentURL != "") {
            if($redirect && $contentURL != $currentURL) {
                # output header redirect
                header('Location: '.$contentURL,TRUE,301);
            }

            $trail = '<link rel="canonical" href="'.$contentURL.'" />';
         }
    }

    return $trail;
}

function smarty_cms_help_function_checkpageurl() {
  echo "Func to output current page URL and redirect if not correct (needs redirect=true parameter)";
}

function smarty_cms_about_function_checkpageurl() {
  echo "Func to output current page URL and redirect if not correct (needs redirect=true parameter)";
}

function curPageURL() {
 $pageURL = 'http';
 if ($_SERVER["HTTPS"] == "on") {$pageURL .= "s";}
 $pageURL .= "://";
 if ($_SERVER["SERVER_PORT"] != "80") {
  $pageURL .= $_SERVER["SERVER_NAME"].":".$_SERVER["SERVER_PORT"].$_SERVER["REQUEST_URI"];
 } else {
  $pageURL .= $_SERVER["SERVER_NAME"].$_SERVER["REQUEST_URI"];
 }
 return $pageURL;
}

That way it will dynamically check as the page loads to see if the current URL matches the current 'Pretty URL' and if not it will redirect using a 301 redirect. 

Any reason why that won't do the trick?  BTW, thanks for cmsms - I'm a brand new user and so far it looks great!

Thanks,

Jay
User avatar
CWebguy
Forum Members
Forum Members
Posts: 139
Joined: Thu Jul 24, 2008 3:31 am

Re: Pretty urls and duplicate content

Post by CWebguy »

Google has a new canonical url tag that should fix this.

http://lancepat.wordpress.com/2009/02/1 ... l-url-tag/
CMSMS Made
escape2mtns
New Member
New Member
Posts: 2
Joined: Mon Feb 16, 2009 10:34 pm

Re: Pretty urls and duplicate content

Post by escape2mtns »

The above plugin will dynamically either generate the google canonical tag or do the 301 rewrite using php headers. 

After implementation yesterday, I noticed that the plugin was redirecting the definitions of the "glossary" module back the main listing page.  I assume this is because the CMSMS getURL function returns the main CMS page URL while the glossary module on that page is generating a URL to pass parameters.  This version fixes that issue.

Code: Select all

# {checkpageurl} should be in <head> of template to generate canonical tag
#
# use {checkpageurl redirect=true} for 301 redirect
#

function smarty_cms_function_checkpageurl($params, &$smarty)
{

    global $gCms;
    $redirect = false;
    $stripExt = '';
    $ignoreModSubPage = false;

    $config = &$gCms->config;
    if(isset($config["page_extension"]) && !empty($config["page_extension"])) {
        $stripExt = $config["page_extension"];
        }

    $manager = &$gCms->GetHierarchyManager();

    $thispage = $gCms->variables['content_id'];

    $trail = "";

    #Check if user has specified a redirect to proper url, otherwise use default
    if (isset($params['redirect'])) {
        $redirect = $params['redirect'];
    }   else {
        $redirect = false;
    }

    $endNode = &$manager->sureGetNodeById($thispage);

    if (isset($endNode))
    {
        $content =& $endNode->getContent();
        $contentURL = $content->getURL();
        $currentURL = curPageURL();
         if ($contentURL != "") {
            if($redirect && $contentURL != $currentURL && strpos($currentURL,$contentURL)===false) {
                # 3rd check strpos - prevents redirect if parameters on url

                // There is one last check we should do... let's strip the config-mandated extension
                // from the URL and see if the current page is a sub-page that's module-generated (ex: glossary module definitions)
                if(!empty($stripExt)) {
                    $cleanedURL = str_replace($stripExt, '', $contentURL).'/';

                    if(!empty($cleanedURL) && $cleanedURL != $currentURL && strpos($currentURL, $cleanedURL) !== false) {
                        // this page is likely a module generated page
                        $ignoreModSubPage = true;
                    }
                }

                if(!$ignoreModSubPage) {
                    # output header redirect
                    header('Location: '.$contentURL,TRUE,301);
                }
            }


            if(!$ignoreModSubPage) {
                $trail = '<link rel="canonical" href="'.$contentURL.'" />';
            }
         }
    }
    return $trail;
}

function smarty_cms_help_function_checkpageurl() {
  echo "Func to output current page URL and redirect if not correct (needs redirect=true parameter)";
}

function smarty_cms_about_function_checkpageurl() {
  echo "Func to output current page URL and redirect if not correct (needs redirect=true parameter)";
}

function curPageURL() {
 $pageURL = 'http';
 if ($_SERVER["HTTPS"] == "on") {$pageURL .= "s";}
 $pageURL .= "://";
 if ($_SERVER["SERVER_PORT"] != "80") {
  $pageURL .= $_SERVER["SERVER_NAME"].":".$_SERVER["SERVER_PORT"].$_SERVER["REQUEST_URI"];
 } else {
  $pageURL .= $_SERVER["SERVER_NAME"].$_SERVER["REQUEST_URI"];
 }
 return $pageURL;
}

calguy1000
Support Guru
Support Guru
Posts: 8169
Joined: Tue Oct 19, 2004 6:44 pm

Re: Pretty urls and duplicate content

Post by calguy1000 »

in 1.5.3 the

Code: Select all

<link rel="canonical" href="{$content_obj->GetURL()}" />
code will be in the default templates, for new installs.

But as noted... this won't work properly for some pages dynamically generated by modules (particularly detail reports for News, etc).
The proper solution for this is to include the proper URL inside the various module templates, and to do some smarty magic to set the canonical link tag properly based on that URL that's supplied in the module templates.

Unfortunately, this requires a code modification to each and every module.
Follow me on twitter
Please post system information from "Extensions >> System Information" (there is a bbcode option) on all posts asking for assistance.
--------------------
If you can't bother explaining your problem well, you shouldn't expect much in the way of assistance.
User avatar
CWebguy
Forum Members
Forum Members
Posts: 139
Joined: Thu Jul 24, 2008 3:31 am

Re: Pretty urls and duplicate content

Post by CWebguy »

um, that stinks
CMSMS Made
Pierre M.

Re: Pretty urls and duplicate content

Post by Pierre M. »

Hello all, thank you Calguy

In the sample templates in 1.5.3, will this "{$content_obj->GetURL()}" generate the full http://w3.domain.tld/animal/bird/eagle.html ? I mean : does it generate the link from what the pretty URLs configuration is already (with hierarchy and suffix) ? I mean : there is nothing more to do than adding this in the templates and CMSms sites are all bleeding-edge Google compliant ? I'm not dreaming ?

Pierre M.
Locked

Return to “[locked] Installation, Setup and Upgrade”