Page 1 of 2

Some URLs redirect to home page instead of 404

Posted: Sun Nov 05, 2006 3:06 pm
by Grantovich
A long time ago, my site used a different CMS, and its URLs looked like this:

http://site.com/?q=node/32

Now I have CMSMS, and I would like all those old URLs to go to a proper 404 error page so the search engines will remove them from their indexes. The problem is, CMSMS keeps redirecting them to the home page, causing the search bots to think they are still active, and my home page is continuously re-indexed under 10 or 20 different (meaningless) URLs. All other nonexistant pages go to a 404 message, but for some reason, URLs in the form "site.com/?q=anything" are always redirected to the home page.

Here's my .htaccess (it makes things come out as "site.com/parent/child"):

Code: Select all

Options -Indexes
Options +FollowSymLinks
RewriteEngine on
RewriteBase /

RewriteCond %{HTTP_HOST} ^www\.(.*)$
RewriteRule ^(.*)$ http://%1/$1 [R,L]

RewriteCond %{REQUEST_FILENAME} !-f [NC]
RewriteCond %{REQUEST_FILENAME} !-d [NC]
RewriteRule ^(.+)$ index.php?page=$1 [QSA]
Anybody know how I can stop the mystery redirects?

Re: Some URLs redirect to home page instead of 404

Posted: Sun Nov 05, 2006 4:26 pm
by tsw
Grantovich wrote: RewriteCond %{HTTP_HOST} ^www\.(.*)$
RewriteRule ^(.*)$ http://%1/$1 [R,L]
not sure but those rules looks like it might be the cause....

Re: Some URLs redirect to home page instead of 404

Posted: Sun Nov 05, 2006 5:42 pm
by Grantovich
tsw wrote:
Grantovich wrote: RewriteCond %{HTTP_HOST} ^www\.(.*)$
RewriteRule ^(.*)$ http://%1/$1 [R,L]
not sure but those rules looks like it might be the cause....
Hmm... what that is supposed to do is strip the "www." from the start of the address, if present. I commented out this section and tried again, but it still redirected "site.com/?q=anything" to the home page.

EDIT: I really should not be saying redirect because that's not what it is. If you enter a /?q= URL into the browser, it returns the home page, but doesn't actually redirect you to the home page's "true" address.

Re: Some URLs redirect to home page instead of 404

Posted: Sun Nov 05, 2006 9:38 pm
by swgreed
Add this to your .htaccess

Code: Select all

ErrorDocument 404 http://www.domain.com/404.htm

Re: Some URLs redirect to home page instead of 404

Posted: Mon Nov 06, 2006 12:53 am
by Grantovich
swgreed wrote: Add this to your .htaccess

Code: Select all

ErrorDocument 404 http://www.domain.com/404.htm
Doesn't seem to change anything. "http://domain.com/?q=whatever" still returns the home page.

Re: Some URLs redirect to home page instead of 404

Posted: Wed Nov 08, 2006 9:42 pm
by Grantovich
Well, I have some good news: It appears that the search engines are gradually realizing that all of the /?q= URLs return the same page, and they are removing these URLs as duplicate content. The process is glacially slow, of course.

I tried making my own RewriteRule that would return a 410 error if the URL contained "?q=" anywhere in it, but I must not have a very good understanding of the mechanism, because it didn't change anything. Then again, for all I know the problem could be in CMSMS and have nothing whatsoever to do with htaccess. Anybody have any ideas?

Re: Some URLs redirect to home page instead of 404

Posted: Thu Nov 09, 2006 9:07 pm
by Pierre M.
HTTP "Gone" is 410 and not 404.
What about something like this :
RewriteRule ^?q=node/.* - [R=410,L]
Not quite sure, but see Apache doc.

And don't have duplicated content (same content available at multiple URLs ) : your site may be blacklisted as spammer.
Hope it helps.

PM

Re: Some URLs redirect to home page instead of 404

Posted: Fri Nov 10, 2006 12:55 am
by Grantovich
Pierre M. wrote: HTTP "Gone" is 410 and not 404.
What about something like this :
RewriteRule ^?q=node/.* - [R=410,L]
Not quite sure, but see Apache doc.

And don't have duplicated content (same content available at multiple URLs ) : your site may be blacklisted as spammer.
Hope it helps.

PM
Ow! Adding this line results in every page on the entire site giving a "500 Internal Server Error". Strange, because it looks legitimate given my extremely limited knowledge of htaccess.

By the way, I don't have any (intentionally) duplicated content. The problem is that the home page is returned, not redirected, for all of the old ?q= URLs. As far as the search engines are concerned, all of the old URLs are completely different pages, but with exactly the same content. There is no redirecting involved; the home page effectively exists at all of those locations. It's that particular attribute of the problem that's driving me batty.

Re: Some URLs redirect to home page instead of 404

Posted: Sat Nov 11, 2006 7:59 am
by Pierre M.
I said I was not quite sure ;-)
Here is another try :

1°)Set up a static error page for 410 :

ErrorDocument 404 /sorry_not_found.html
ErrorDocument 410 /sorry_gone.html

2°)Tell the old stuff has gone with mod_alias :

Redirect gone /q=node
according to http://httpd.apache.org/docs/2.2/mod/mo ... l#redirect

or use mod_rewrite (I've previously forgotten the '$') :
RewriteEngine On
RewriteBase /
RewriteRule ^?q=node/.*$ - [G,L]

...or may be RewriteRule ^?q=node/.*$ /sorry_gone.html [G,L]

PM

Re: Some URLs redirect to home page instead of 404

Posted: Sat Nov 11, 2006 3:58 pm
by Grantovich
Pierre M. wrote: I said I was not quite sure ;-)
Here is another try :

1°)Set up a static error page for 410 :

ErrorDocument 404 /sorry_not_found.html
ErrorDocument 410 /sorry_gone.html

2°)Tell the old stuff has gone with mod_alias :

Redirect gone /q=node
according to http://httpd.apache.org/docs/2.2/mod/mo ... l#redirect

or use mod_rewrite (I've previously forgotten the '$') :
RewriteEngine On
RewriteBase /
RewriteRule ^?q=node/.*$ - [G,L]

...or may be RewriteRule ^?q=node/.*$ /sorry_gone.html [G,L]

PM
I tried every one of the following (turns out the "500 Internal Server Error" comes from not escaping the question mark):

Redirect gone /?q=
RedirectMatch gone /\?q=.*
RewriteRule ^\?q=.*$ - [G,L]
RewriteRule ^\?q=.*$ /gone.htm [G,L]
RewriteRule \?q=.* /gone.htm [G,L]
RewriteRule (.*)\?q=(.*) /gone.htm [G,L]

None of them had any effect whatsoever. It especially irritates me that not even "RedirectMatch" worked, since I'm quite positive I'm using it the right way, and the documentation agrees. Argh.

Re: Some URLs redirect to home page instead of 404

Posted: Sun Nov 12, 2006 7:04 pm
by Pierre M.
Argh as you say.

So, there is no more any 500 ? Good point escaping the question mark. Well done, I had missed it.
What is your OS and what is your Apache ?
And what about the client side ? Have you tried with wget verbose on ?

I agree RedirectMatch should be the best. But tell us what is happening : the browser/wget gets 404, 3xx, 5xx, something else ?
(may be you should clear client side cache each time you test)

Sorry to ask for checking this, but with RewriteRule ^\?q=.*$ - [G,L] do you have RewriteEngine On and a good RewriteBase too ?
Please check the name and location of your .htaccess too.
And the files' permissions (.htaccess, gone.htm, 404.html...)

PM

Re: Some URLs redirect to home page instead of 404

Posted: Mon Nov 13, 2006 11:58 pm
by Grantovich
I'm not sure what server information I'm looking for, so I have just attached the result of the phpinfo() function on my site. Hopefully it will be of some use. (you will need to change the extension to .htm to view it)

http://www.seoconsultants.com/tools/headers.asp
I have used this site to test the return codes for pages that start with "/?q=". Whether or not I have the RewriteRule or RedirectMatch in the htaccess, it always returns the same thing: 200 OK. I try a different URL each time to avoid caching issues (like "/?q=watermelon", "/?q=pineapple", etc).

As you can see in the original htaccess at the top of the post, I have RewriteEngine on and RewriteBase /. The file permissions must be set correctly, because the original file works and does exactly what I want it to.

[gelöscht durch Administrator]

Re: Some URLs redirect to home page instead of 404

Posted: Tue Nov 14, 2006 12:15 pm
by Pierre M.
So, your IE answers 404 and somebody else answers 200. At least one of you is wrong. You must have good information on this to go on.

So I resuggest : Have you tried with wget instead of IE ?

wget -v -S -o log1 http://my.server.tld/known/working/static/url1.html
wget -v -S -o log2 http://my.server.tld/known/redirected/static/url2.html
wget -v -S -o log3 http://my.server.tld/?q=kiwi

Please report a matrix of your RedirectMatch/RewriteRule combinaison vs these 3 columns, with code (2xx-5xx), content result and log in each cell.

BTW, please check your username/password to the db in config.php too.

PM

Re: Some URLs redirect to home page instead of 404

Posted: Wed Nov 15, 2006 12:37 am
by Grantovich
Pierre M. wrote: So, your IE answers 404 and somebody else answers 200. At least one of you is wrong. You must have good information on this to go on.
I don't understand... what do you mean by "my IE answers 404 and somebody else answers 200"? Anyway, I was about to say I can't use wget because I'm on a shared server with no shell access, but then I remembered cron jobs. So here we go.

Key: htaccess line -- code returned by site.com/static.html -- code returned by site.com/parent/child (remapped by htaccess to site.com/index.php?page=child) -- code returned by site.com/?q=something

No change from original (see first post) -- 200 -- 200 -- 200
RedirectMatch gone /\q=.* -- 200 -- 200 -- 200
RedirectMatch gone (.*)\?q=(.*) -- 200 -- 200 -- 200
RewriteRule ^\?q=.*$ - [G,L] -- 200 -- 200 -- 200
RewriteRule ^\?q=.*$ /gone.htm [G,L] -- 200 -- 200 -- 200
RewriteRule \?q=.* /gone.htm [G,L] -- 200 -- 200 -- 200

The logs for all the static page requests look like this:

Code: Select all

--19:20:01--  http://site.com/static.html
           => `static.html'
Resolving site.com... x.x.x.x
Connecting to site.com|x.x.x.x|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Date: Wed, 15 Nov 2006 00:20:01 GMT
  Server: Apache/1.3.37 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/1.2
mod_bwlimited/1.4 PHP/4.4.3 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.28
OpenSSL/0.9.7a
  Last-Modified: Tue, 31 Oct 2006 01:03:45 GMT
  ETag: "12bc028-0-4546a0f1"
  Accept-Ranges: bytes
  Content-Length: 0
  Keep-Alive: timeout=15, max=100
  Connection: Keep-Alive
  Content-Type: text/html
Length: 0 [text/html]
The logs for all the rewritten page requests look like this:

Code: Select all

--19:22:01--  http://site.com/parent/child
           => `child'
Resolving site.com... x.x.x.x
Connecting to site.com|x.x.x.x|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Date: Wed, 15 Nov 2006 00:22:01 GMT
  Server: Apache/1.3.37 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/1.2
mod_bwlimited/1.4 PHP/4.4.3 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.28
OpenSSL/0.9.7a
  X-Powered-By: PHP/4.4.3
  Connection: close
  Content-Type: text/html; charset=UTF-8
Length: unspecified [text/html]
The logs for all the ?q= requests look like this:

Code: Select all

--19:24:01--  http://site.com/?q=kiwi
           => `index.html?q=kiwi'
Resolving site.com... x.x.x.x
Connecting to site.com|x.x.x.x|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Date: Wed, 15 Nov 2006 00:24:02 GMT
  Server: Apache/1.3.37 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/1.2
mod_bwlimited/1.4 PHP/4.4.3 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.28
OpenSSL/0.9.7a
  X-Powered-By: PHP/4.4.3
  Connection: close
  Content-Type: text/html; charset=UTF-8
Length: unspecified [text/html]
I get the exact same logs for all the htaccess lines listed above. In every case, the static page returns itself, the remapped page returns its correct location (site.com/parent/child is correctly remapped to site.com/index.php?page=child), and the ?q= URL returns the home page (exactly the same as going to just site.com). Please correct me if I've left out some information that is important.

Re: Some URLs redirect to home page instead of 404

Posted: Wed Nov 15, 2006 10:25 pm
by Pierre M.
If Apache allways answers "200 OK", may be your Internet Explorer (IE) is bogus when reporting 404 ?
PM