The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt which should be placed in the top-level directory (root directory) of a website.
The protocol is purely advisory. It relies on the cooperation of the web robot, so that marking an area of your site out of bounds with robots.txt does not guarantee privacy. Many web site administrators have been caught trying to use the robots file to make private parts of a website invisible to the rest of the world. However, the file is necessarily publicly available and is easily checked by anyone with a web browser.
Overall it is a good idea to have a robots.txt file as all the major search engines look for one when their spiders, bots and crwalers arrive at your site. If you are in the habit of checking you server logs you can reduce the number of 404 not found errors which are generated by spiders that have visited your site but are unable to find a robots.txt file!
Creating a robots.txt File
A robot.txt file can be created simply using a plain text editor such as Notepad. Create a new text file and save it as robots.txt. The format for indicating which directories and files should NOT be indexed is:
User-Agent: Spider or robot name
Disallow: Directory or File Name
This could be repeated for each directory or file you want to prevent from being indexed, and for each spider or robot you want to exclude but there are a couple of shorthand methods available.
Examples
This example allows all robots to visit all files because the wildcard "*" specifies all robots.
User-agent: *
Disallow:
This example keeps all robots out:
User-agent: *
Disallow: /
The next is an example that tells all crawlers not to enter into four directories of a website:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/
Example that tells a specific crawler not to enter one specific directory:
User-agent: BadBot
Disallow: /private/
Example for the default installation of PHP Fusion:
User-agent: *
Disallow: /administration/
Disallow: /images/
Disallow: /locale/
Disallow: /themes/
Disallow: /blank_config.php
Disallow: /config.php
Disallow: /edit_profile.php
Disallow: /footer.php
Disallow: /maincore.php
Disallow: /members.php
Disallow: /setuser.php
Disallow: /side_left.php
Disallow: /side_right.php
Disallow: /subheader.php
This article is licensed under the GNU Free Documentation License. It uses material from the Wikipedia article "Robots.txt".
|