Monday, October 12, 2009

What is robots.txt file

The Secret of robots.txt in the eyes of Search Engines!

Read for creating error free robots.txt


Although the robots.txt file is a very important file if you want to have a good ranking on search engines, many Web sites don't offer this file.


If your Web site doesn't have a robots.txt file yet, read on to learn how to create one. If you already have a robots.txt file, read our tips to make sure that it doesn't contain errors.

http://www.webexperto.com/articulos/archivos/275/robots-txt.png

What is robots.txt?


Robots.txt is a text (not html) file you put on your site to tell search robots which pages you would like them not to visit. Robots.txt is by no means mandatory for search engines but generally search engines obey what they are asked not to do. It is important to clarify that robots.txt is not a way from preventing search engines from crawling your site (i.e. it is not a firewall, or a kind of password protection) and the fact that you put a robots.txt file is something like putting a note “Please, do not enter” on an unlocked door – e.g. you cannot prevent thieves from coming in but the good guys will not open to door and enter. That is why we say that if you have really sen sitive data, it is too naïve to rely on robots.txt to protect it from being indexed and displayed in search results.
Locations of robots.txt


The location of robots.txt is very important. It must be in the main directory because otherwise user agents (search engines) will not be able to find it – they do not search the whole site for a file named robots.txt. Instead, they look first in the main directory (i.e. http://mydomain.com/robots.txt) and if they don't find it there, they simply assume that this site does not have a robots.txt file and therefore they index everything they find along the way. So, if you don't put robots.txt in the right place, do not be surprised that search engines index your whole site.

How do I create a robots.txt file?


So lets get moving. Create a regular text file called "robots.txt", and make sure it's named exactly that. This file must be uploaded to the root accessible directory of your site, not a subdirectory. It is only by following the above two rules will search engines interpret the instructions contained in the file. Deviate from this, and "robots.txt" becomes nothing more than a regular text file, like Cinderella after midnight.


Now that you know what to name your text file and where to upload it, you need to learn what to actually put in it to send commands off to search engines that follow this protocol (formally the "Robots Exclusion Protocol"). The format is simple enough for most intents and purposes: a USERAGENT line to identify the crawler in question followed by one or more DISALLOW: lines to disallow it from crawling certain parts of your site.

1. Here's a basic "robots.txt":

User-agent: *
Disallow: /

With the above declared, all robots (indicated by "*") are instructed to not index any of your pages (indicated by "/"). Most likely not what you want, but you get the idea.

2. Google's Image bot:

Lets get a little more discriminatory now. While every webmaster loves Google, you may not want Google's Image bot crawling your site's images and making them searchable online, if just to save bandwidth. The below declaration will do the trick:

User-agent: Googlebot-Image
Disallow: /

3. Disallows all Search Engines

The following disallows all search engines and robots from crawling select directories and pages:

User-agent: *
Disallow: /cgi-bin/
Disallow: /privatedir/
Disallow: /tutorials/blank.htm

4. Target multiple robots in robots.txt.

You can conditionally target multiple robots in "robots.txt." Take a look at the below:

User-agent: * Disallow: / User-agent: Googlebot Disallow: /cgi-bin/ Disallow: /privatedir/

This is interesting - here we declare that crawlers in general should not crawl any parts of our site, EXCEPT for Google, which is allowed to crawl the entire site apart from /cgi-bin/ and /privatedir/. So the rules of specificity apply, not inheritance.

5. Where can I find user agent names?

You can find user agent names in your log files by checking for requests to robots.txt. Most often, all search engine spiders should be given the same rights. in that case, use "User-agent: "*" as mentioned above.

Basic tips and tricks

1. How to disallow all spiders to index any file

If you don't want search engines to index any file of your Web site, use the following:

User-agent: *
Disallow: /

2. How to allow all search engine spiders to index all files

Use the following content for your robots.txt file if you want to allow all search engine spiders to index all files of your Web site:

User-agent: *
Disallow:

3. Where to find more complex examples.

You can find such examples with the following websites:

http://www.cnn.com/robots.txt
http://www.nytimes.com/robots.txt
http://www.spiegel.com/robots.txt
http://www.ebay.com/robots.txt

Your Web site must deal with a proper robots.txt file, if you want to have good rankings across all the search engines. Only if search engines know what to do with your pages, they can give you a good ranking though :).

No comments:

Post a Comment