exclude page from search engines

edented · Post by **edented** » Tue Jun 01, 2010 3:24 pm

Is it possible to stop a searchbot from accessing a single page from within the admin panel?

I've created page that notifies me via email when a visitor has accessed the page (a download), but it seems that it's being visited by the crawlers.

Thanks

NaN · Post by **NaN** » Sun Jun 06, 2010 1:53 am

Searchengines are identified by the user agent.
You could create a UDT that checks the user agent and if it is a known search bot and the content property "searchable" is ot set return an empty page with header 403.
This UDT can be placed in the pagedata or in the template.

You could also try a robots.txt.

calguy1000 · Post by **calguy1000** » Sun Jun 06, 2010 1:56 am

uhm, add a meta noindex tag... add a nofollow attribute to the links to that page...

Peciura · Post by **Peciura** » Sun Jun 06, 2010 8:11 am

Code: Select all

<meta content="noindex, nofollow" name="robots"/>

NaN · Post by **NaN** » Sun Jun 06, 2010 12:04 pm

But the metadata does not prevent a bot to access the page.
It just won't be indexed.
Anyway the whole page is already processed when the bot recieves the noindex, nofollow stuff.
And that means the script that informs you via email might already been processed no matter what is in the metadata.

This is why i would do this using a plugin or UDT to prevent the page to be rendered and avoid any scripts to be processed if the user agent is a just a bot.

By the way how is the script working?
Is it event driven or dou you have a plugin/UDT placed in your page?
Maybe you can just change your email script to check for the user agent and then decide if an email will be sent or not.

edented · Post by **edented** » Sun Jun 06, 2010 6:33 pm

Thanks for the replies-
I have a UDT that captures the ip and useragent, and sends a formatted email, and an onload in the source that redirects to the download.

Strangely enough, the page has been crawled twice (google and yahoo) almost immediately after it was created, but not since. I worried that i would catch a flood of non human visitors, and still might once I actually submit the site to search engines, or when i add more links to the page.

I'll add the robots.txt as

Code: Select all

User-agent: *
Disallow: /index.php?page=somepagename

if it gets out of control, maybe i'll attempt to set up a bot trap
http://www.kloth.net/internet/bottrap.php

As i'm more of a muddler than a programmer, scripting the UDT to trap for bots and then return a 403 header might take me an inordinate amount of time.

Thanks again!

NaN · Post by **NaN** » Sun Jun 06, 2010 7:26 pm

edented wrote:
I have a UDT that captures the ip and useragent, and sends a formatted email, and an onload in the source that redirects to the download.

If your script already captures the useragent you just need to check if it matches with any known bots. If not send the email. Else... just do nothing.
You could create an array that contains all known bots and after the script captured the useragent just check with this:

Code: Select all


$bots = array('bot1','bot2','bot3', ...);
if(!in_array($useragent,$bots)) {
   send email...
}

(just an example, no working code)

You can do the same thing with bad ips.
And whenever your bot-trap (if you use one) identifies a new bot you can adapt the array and add the useragent and/or the ip address.
So you won't get emails if just bots crawl your site and you don't need to hide the page from bots search indexes.

CMS Made Simple Forums

exclude page from search engines

exclude page from search engines

Re: exclude page from search engines

Re: exclude page from search engines

Re: exclude page from search engines

Re: exclude page from search engines

Re: exclude page from search engines

Re: exclude page from search engines