Page 1 of 1

exclude page from search engines

Posted: Tue Jun 01, 2010 3:24 pm
by edented
Is it possible to stop a searchbot from accessing a single page from within the admin panel?

I've created page that notifies me via email when a visitor has accessed the page (a download), but it seems that it's being visited by the crawlers.

Thanks

Re: exclude page from search engines

Posted: Sun Jun 06, 2010 1:53 am
by NaN
Searchengines are identified by the user agent.
You could create a UDT that checks the user agent and if it is a known search bot and the content property "searchable" is ot set return an empty page with header 403.
This UDT can be placed in the pagedata or in the template.

You could also try a robots.txt.

Re: exclude page from search engines

Posted: Sun Jun 06, 2010 1:56 am
by calguy1000
uhm, add a meta noindex tag... add a nofollow attribute to the links to that page...

Re: exclude page from search engines

Posted: Sun Jun 06, 2010 8:11 am
by Peciura

Code: Select all

<meta content="noindex, nofollow" name="robots"/>

Re: exclude page from search engines

Posted: Sun Jun 06, 2010 12:04 pm
by NaN
But the metadata does not prevent a bot to access the page.
It just won't be indexed.
Anyway the whole page is already processed when the bot recieves the noindex, nofollow stuff.
And that means the script that informs you via email might already been processed no matter what is in the metadata.

This is why i would do this using a plugin or UDT to prevent the page to be rendered and avoid any scripts to be processed if the user agent is a just a bot.

By the way how is the script working?
Is it event driven or dou you have a plugin/UDT placed in your page?
Maybe you can just change your email script to check for the user agent and then decide if an email will be sent or not.

Re: exclude page from search engines

Posted: Sun Jun 06, 2010 6:33 pm
by edented
Thanks for the replies-
I have a UDT that captures the ip and useragent, and sends a formatted email, and an onload in the source that redirects to the download.

Strangely enough, the page has been crawled twice (google and yahoo) almost immediately after it was created, but not since. I worried that i would catch a flood of non human visitors, and still might once I actually submit the site to search engines, or when i add more links to the page.

I'll add the robots.txt as

Code: Select all

User-agent: *
Disallow: /index.php?page=somepagename

if it gets out of control, maybe i'll attempt to set up a bot trap
http://www.kloth.net/internet/bottrap.php

As i'm more of a muddler than a programmer, scripting the UDT to trap for bots and then return a 403 header might take me an inordinate amount of time.

Thanks again!

Re: exclude page from search engines

Posted: Sun Jun 06, 2010 7:26 pm
by NaN
edented wrote:
I have a UDT that captures the ip and useragent, and sends a formatted email, and an onload in the source that redirects to the download.
If your script already captures the useragent you just need to check if it matches with any known bots. If not send the email. Else... just do nothing.
You could create an array that contains all known bots and after the script captured the useragent just check with this:

Code: Select all


$bots = array('bot1','bot2','bot3', ...);
if(!in_array($useragent,$bots)) {
   send email...
}

(just an example, no working code)

You can do the same thing with bad ips.
And whenever your bot-trap (if you use one) identifies a new bot you can adapt the array and add the useragent and/or the ip address.
So you won't get emails if just bots crawl your site and you don't need to hide the page from bots search indexes.