robots.txt file for websites
This is a special purpose file which is exclusively used by search engine robots. This file has to be located at the root of the site or at highest level of your site. Search engine robots follow the instruction given at this text file about your site to crawl and index pages. Note that to index pages we should have at least one link from any active page to register the presence of it in search engine. Robots.txt page is not a place to add links to different sections of the site. Site Map is best place for this purpose. We need not specify the location of robts.txt file from any location within or outside our site. If your site name is www.mysite.com then the url of your robts.txt file will be www.sitename.com/robots.txt . By default search engines first try for this page so if you don't have this page then page not found error ( 404 ) you will find in your server log against this URL. We will discuss about the purpose of this file first and then some sample code.
Purpose of robots.txt file
All search engines obey the instructions given by the webmaster through this robots.txt file. So by using this we can communicate to engines. One purpose is to tell crawling robots not to index some page or part of the website. We can tell this to specific crawlers or to all search engine robots.
We may keep some area as archive where we can store copy of pages existing in main areas. We can prevent robots to crawl these duplicate pages by adding that path to the robots.txt file.
However it is clear that spam bots, bad bots won't respect the directive of robots.txt file as expected.
Robots.txt mostly used to tell the robots what not to index than to tell what to index.
Note that robots.txt file is a public document so any one can just open it and see its content. If you have any private URL which you don't want to expose then it is not a good idea to restrict the indexing of this page by adding it to robots.txt file. Any hacker or unauthorized uses can exploit it.
Google robots.txt analysis tool
To allow all robots to all dir and files
The above code will allow all agents to crawl all pages. Now let us disallow all bots to index our site.
Ok now let us try to tell all bots not to index one directory ( name of the directory is restrict-dir )
In the above code we have told not to index or crawl restrict-dir directory.
We can block or allow specific user agent also. Now let us all google bot to all pages.
Now let us try to disallow google bot to one directory only.
From here you can easily understand how to use robotos.txt file. You can use many robots.txt generators available on the internet. Google has one robots.txt file generator inside its webmaster tools, but you must have one google account to use this. Google can analyze and tell you want is wrong in your robots.txt file inside its webmaster tools. If you have accidentally blocked some part of your site then here you can come to know about this.
Sitemap & robots.txt file
You can add the url of your sitemap to the robots.txt file. This will help engines to pick up your sitemap file. Add this line at the end of your robots.txt file
Try to locate the robots.txt file of this site and see the text inside it.
Number of User Comments : 2