exclude page from search engines

For questions and problems with the CMS core. This board is NOT for any 3rd party modules, addons, PHP scripts or anything NOT distributed with the CMS made simple package itself.
Post Reply
edented
Forum Members
Forum Members
Posts: 23
Joined: Sat Aug 29, 2009 4:24 am

exclude page from search engines

Post by edented »

Is it possible to stop a searchbot from accessing a single page from within the admin panel?

I've created page that notifies me via email when a visitor has accessed the page (a download), but it seems that it's being visited by the crawlers.

Thanks
NaN

Re: exclude page from search engines

Post by NaN »

Searchengines are identified by the user agent.
You could create a UDT that checks the user agent and if it is a known search bot and the content property "searchable" is ot set return an empty page with header 403.
This UDT can be placed in the pagedata or in the template.

You could also try a robots.txt.
calguy1000
Support Guru
Support Guru
Posts: 8169
Joined: Tue Oct 19, 2004 6:44 pm

Re: exclude page from search engines

Post by calguy1000 »

uhm, add a meta noindex tag... add a nofollow attribute to the links to that page...
Follow me on twitter
Please post system information from "Extensions >> System Information" (there is a bbcode option) on all posts asking for assistance.
--------------------
If you can't bother explaining your problem well, you shouldn't expect much in the way of assistance.
Peciura

Re: exclude page from search engines

Post by Peciura »

Code: Select all

<meta content="noindex, nofollow" name="robots"/>
NaN

Re: exclude page from search engines

Post by NaN »

But the metadata does not prevent a bot to access the page.
It just won't be indexed.
Anyway the whole page is already processed when the bot recieves the noindex, nofollow stuff.
And that means the script that informs you via email might already been processed no matter what is in the metadata.

This is why i would do this using a plugin or UDT to prevent the page to be rendered and avoid any scripts to be processed if the user agent is a just a bot.

By the way how is the script working?
Is it event driven or dou you have a plugin/UDT placed in your page?
Maybe you can just change your email script to check for the user agent and then decide if an email will be sent or not.
edented
Forum Members
Forum Members
Posts: 23
Joined: Sat Aug 29, 2009 4:24 am

Re: exclude page from search engines

Post by edented »

Thanks for the replies-
I have a UDT that captures the ip and useragent, and sends a formatted email, and an onload in the source that redirects to the download.

Strangely enough, the page has been crawled twice (google and yahoo) almost immediately after it was created, but not since. I worried that i would catch a flood of non human visitors, and still might once I actually submit the site to search engines, or when i add more links to the page.

I'll add the robots.txt as

Code: Select all

User-agent: *
Disallow: /index.php?page=somepagename

if it gets out of control, maybe i'll attempt to set up a bot trap
http://www.kloth.net/internet/bottrap.php

As i'm more of a muddler than a programmer, scripting the UDT to trap for bots and then return a 403 header might take me an inordinate amount of time.

Thanks again!
NaN

Re: exclude page from search engines

Post by NaN »

edented wrote:
I have a UDT that captures the ip and useragent, and sends a formatted email, and an onload in the source that redirects to the download.
If your script already captures the useragent you just need to check if it matches with any known bots. If not send the email. Else... just do nothing.
You could create an array that contains all known bots and after the script captured the useragent just check with this:

Code: Select all


$bots = array('bot1','bot2','bot3', ...);
if(!in_array($useragent,$bots)) {
   send email...
}

(just an example, no working code)

You can do the same thing with bad ips.
And whenever your bot-trap (if you use one) identifies a new bot you can adapt the array and add the useragent and/or the ip address.
So you won't get emails if just bots crawl your site and you don't need to hide the page from bots search indexes.
Post Reply

Return to “CMSMS Core”