Page 1 of 1
Robots.txt
Posted: Thu Feb 24, 2005 9:17 am
by spoetnik
Is it an idea to include a 'robots.txt' file to the default install??
Maybe for security reasons? Something like this.
Code: Select all
User-agent: *
Disallow: /admin
Disallow: /doc
Disallow: /images
Disallow: /install
Disallow: /lib
Disallow: /modules
Disallow: /plugins
Disallow: /tmp
Disallow: /uploads
Re: Robots.txt
Posted: Thu Feb 24, 2005 6:21 pm
by sjg
spoetnik wrote:Is it an idea to include a 'robots.txt' file to the default install??
Maybe for security reasons? Something like this.
Code: Select all
User-agent: *
Disallow: /admin
Disallow: /doc
Disallow: /images
Disallow: /install
Disallow: /lib
Disallow: /modules
Disallow: /plugins
Disallow: /tmp
Disallow: /uploads
I think if we're worried about search engines and hackers being able to get into those areas, it'd be better to use the filesystem's permissions.
My understanding is that the kids who want to deface web sites either attack by pre-written scripts (in which case this won't help), or use robots.txt as a starting point to find out where interesting stuff may be hidden.
Re: Robots.txt
Posted: Fri Sep 23, 2005 2:32 am
by iNSiPiD
Damn straight.
My organisation's site gets more bad requests for robots.txt (which I studiously avoid using) than the index page!
Kids today - I dunno.

Re: Robots.txt
Posted: Fri Sep 30, 2005 12:48 am
by trick
robots.txt has it's uses, particluarly when you have a CMS system that has a lot of links with query variables in them. The bots cant get lost pretty quickly and go wandering around forever. I recently installed mediawiki on my box and I sent by search bot to index stuff - I came back 2 hours later (this is normally a 2 minute process) and it was still wandering around indexing the dumbest most useless pages. I don't know how long he would have stayed there fooling around, but I stopped him, deleted his search database (which was massive because of all the dumb pages indexed), and set up a robot.txt file, which saved the day.
Edit: using a proper robot.txt file can help the Googlebot not get lost as well.
Re: Robots.txt
Posted: Tue Nov 01, 2005 8:36 am
by bh_scripts
For the reasons outlined above robots.txt is a good idea. It is also possible to block direct access to the directories shown in the robots.txt. For example to stop direct running of scripts in the lib directory place a .htaccess file in that directory with the following directive (assuming an apache webserver):
Order Deny,Allow
Deny from all
This will stop direct execution (ie. typing the full url) of any scripts in /lib directory, but will not stop inclusion of scripts via include/require statements.
Repeat for any directories where direct execution is not wanted.
Re: Robots.txt
Posted: Tue Nov 01, 2005 10:40 am
by Ted
That's pretty nifty. Can you do stuff like etc?
Re: Robots.txt
Posted: Tue Nov 01, 2005 11:00 pm
by iNSiPiD
It's a VERY simple protocol that hasn't been updated since 1994. What you're asking is beyond the scope I think but there are better mehtods for securing directories. This is only a basic instruction for benevolent robots. Others will simply ignore it, or use it to their advantage.
http://www.robotstxt.org/wc/exclusion-admin.html
Re: Robots.txt
Posted: Wed Nov 02, 2005 5:37 am
by trick
I think wishy was talking about Apache configuration.
Re: Robots.txt
Posted: Wed Nov 02, 2005 5:39 am
by iNSiPiD
Ah, right you are. Yes. [cough][cough]
Yes, Apache config through htaccess is the way.
Re: Robots.txt
Posted: Tue Nov 29, 2005 8:36 pm
by ZYV
Hi,
I agree with the guy who said that script-kiddies would normally start by looking at robots.txt to find something sweet that should be normally ignored by the search bots. Thus I attach my own robots.txt that I currently use. Hope that helps someone.
[attachment deleted by admin]