Posted by Dan Crow, Product ManagerI'm often asked about how Google and search engines work. One key question is: how does Google know what parts of a website the site owner wants to have show up in search results? Can publishers specify that some parts of the site should be private and non-searchable? The good news is that those who publish on the web have a lot of control over which pages should appear in search results.
The key is a simple file called
robots.txt that has been an industry standard for many years. It lets a site owner control how search engines access their web site. With
robots.txt you can control access at multiple levels -- the entire site, through individual directories, pages of a specific type, down to individual pages. Effective use of
robots.txt gives you a lot of control over how your site is searched, but its not always obvious how to achieve exactly what you want. This is the first of a series of posts on how to use
robots.txt to control access to your content.
What does robots.txt do?The web is big. Really big. You just won't believe how vastly hugely mind-bogglingly big it is. I mean, you might think it's a lot of work maintaining your website, but that's just peanuts to the whole web. (with profound apologies to Douglas Adams)
Search engines like Google read through all this information and create an index of it. The index allows a search engine to take a query from users and show all the pages on the web that match it.
In order to do this Google has a set of computers that continually crawl the web. They have a list of all the websites that Google knows about and read all the pages on each of those sites. Together these machines are known as the Googlebot. In general you want Googlebot to access your site so your web pages can be found by people searching on Google.
However, you may have a few pages on your site you don't want in Google's index. For example, you might have a directory that contains internal logs, or you may have news articles that require payment to access. You can exclude pages from Google's crawler by creating a text file called
robots.txt and placing it in the root directory. The
robots.txt file contains a list of the pages that search engines shouldn't access. Creating a
robots.txt is straightforward and it allows you a sophisticated level of control over how search engines can access your web site.
Fine-grained controlIn addition to the
robots.txt file -- which allows you to concisely specify instructions for a large number of files on your web site -- you can use the robots
META tag for fine-grain control over individual pages on your site. To implement this, simply add specific
META tags to HTML pages to control how each individual page is indexed. Together,
robots.txt and
META tags give you the flexibility to express complex access policies relatively easily.
A simple exampleHere is a simple example of a
robots.txt file.
User-Agent: Googlebot
Disallow: /logs/
The
User-Agent line specifies that the next section is a set of instructions just for the Googlebot. All the major search engines read and obey the instructions you put in
robots.txt, and you can specify different rules for different search engines if you want to. The
Disallow line tells Googlebot not to access files in the
logs sub-directory of your site. The contents of the pages you put into the
logs directory will not show up in Google search results.
Preventing access to a fileIf you have a news article on your site that is only accessible by registered users, you'll want it excluded from Google's results. To do this, simply add a
META tag into the html file, so it starts something like:
<html>
<head>
<meta name="googlebot" content="noindex">
...
This stops Google from indexing this file.
META tags are particularly useful if you have permission to edit the individual files but not the site-wide
robots.txt. They also allow you to specify complex access-control policies on a page-by-page basis.
Learn moreYou can find out more about robots.txt at
http://www.robotstxt.org and at
Google's Webmaster help center, which contains lots of helpful information, including:
We've also done several posts in our
webmaster blog about robots.txt that you may find useful, such as:
Next time...Coming soon: a post detailing the use of robots and metatags, and another on specific examples for common cases.
Update: Added a sentence to paragraph 9 on access-control policies.