Robots.txt

Talk about writing modules and plugins for CMS Made Simple, or about specific core functionality. This board is for PHP programmers that are contributing to CMSMS not for site developers
Post Reply
spoetnik

Robots.txt

Post by spoetnik »

Is it an idea to include a 'robots.txt' file to the default install??
Maybe for security reasons? Something like this.

Code: Select all

User-agent:	*
Disallow:	/admin
Disallow:	/doc
Disallow:	/images
Disallow:	/install
Disallow:	/lib
Disallow:	/modules
Disallow:	/plugins
Disallow:	/tmp
Disallow:	/uploads
User avatar
sjg
Power Poster
Power Poster
Posts: 310
Joined: Thu Jan 27, 2005 5:11 pm
Location: Los Angeles, CA

Re: Robots.txt

Post by sjg »

spoetnik wrote:Is it an idea to include a 'robots.txt' file to the default install??
Maybe for security reasons? Something like this.

Code: Select all

User-agent:	*
Disallow:	/admin
Disallow:	/doc
Disallow:	/images
Disallow:	/install
Disallow:	/lib
Disallow:	/modules
Disallow:	/plugins
Disallow:	/tmp
Disallow:	/uploads
I think if we're worried about search engines and hackers being able to get into those areas, it'd be better to use the filesystem's permissions.

My understanding is that the kids who want to deface web sites either attack by pre-written scripts (in which case this won't help), or use robots.txt as a starting point to find out where interesting stuff may be hidden.
Many modules available from the http://dev.cmsmadesimple.org
The CMS Made Simple Developer Cookbook is now available from Packt Publishers!
iNSiPiD

Re: Robots.txt

Post by iNSiPiD »

Damn straight.

My organisation's site gets more bad requests for robots.txt (which I studiously avoid using) than the index page!

Kids today - I dunno. ;)
trick

Re: Robots.txt

Post by trick »

robots.txt has it's uses, particluarly when you have a CMS system that has a lot of links with query variables in them. The bots cant get lost pretty quickly and go wandering around forever. I recently installed mediawiki on my box and I sent by search bot to index stuff - I came back 2 hours later (this is normally a 2 minute process) and it was still wandering around indexing the dumbest most useless pages. I don't know how long he would have stayed there fooling around, but I stopped him, deleted his search database (which was massive because of all the dumb pages indexed), and set up a robot.txt file, which saved the day.

Edit: using a proper robot.txt file can help the Googlebot not get lost as well.
bh_scripts

Re: Robots.txt

Post by bh_scripts »

For the reasons outlined above robots.txt is a good idea. It is also possible to block direct access to the directories shown in the robots.txt. For example to stop direct running of scripts in the lib directory place a .htaccess file in that directory with the following  directive (assuming an apache webserver):


Order Deny,Allow
Deny from all


This will stop direct execution (ie. typing the full url) of any scripts in /lib directory, but will not stop inclusion of scripts via include/require statements.

Repeat for any directories where direct execution is not wanted.
Ted
Power Poster
Power Poster
Posts: 3329
Joined: Fri Jun 11, 2004 6:58 pm
Location: Fairless Hills, Pa USA

Re: Robots.txt

Post by Ted »

That's pretty nifty.  Can you do stuff like etc?
iNSiPiD

Re: Robots.txt

Post by iNSiPiD »

It's a VERY simple protocol that hasn't been updated since 1994. What you're asking is beyond the scope I think but there are better mehtods for securing directories. This is only a basic instruction for benevolent robots. Others will simply ignore it, or use it to their advantage.

http://www.robotstxt.org/wc/exclusion-admin.html
trick

Re: Robots.txt

Post by trick »

I think wishy was talking about Apache configuration.
iNSiPiD

Re: Robots.txt

Post by iNSiPiD »

Ah, right you are. Yes. [cough][cough]

Yes, Apache config through htaccess is the way.
ZYV
Language Partners
Language Partners
Posts: 868
Joined: Tue Nov 15, 2005 9:08 pm

Re: Robots.txt

Post by ZYV »

Hi,

I agree with the guy who said that script-kiddies would normally start by looking at robots.txt to find something sweet that should be normally ignored by the search bots. Thus I attach my own robots.txt that I currently use. Hope that helps someone.

[attachment deleted by admin]
unsigned double ZYV;
Post Reply

Return to “Developers Discussion”