Matt Cutts has a good video today on Google Webmaster Central explaining how to prevent certain pages on your website from being crawled by the search engines.
You really need to be familiar with four methods of preventing the spiders from crawling your pages:
- password protect
Your htaccess file is a ticket to solving a lot of your search engine problems. Not all of them, but some of them. It’s a file on your server that gives instructions to browsers and search engine spiders, telling them how to read your web pages. One common usage of this file is to use it to redirect old web pages to new web pages. Frequently, webmasters will update their information and when doing so will change the URL of a web page. Well, if you do that then you still have that old web page indexed and when people try to visit that page they will get a 404 error page. To prevent that from happening, you can add a 301 redirect command in your htaccess to redirect traffic to your new page.
But the htaccess has other uses as well and you can actually use it to tell the search engines certain information that will prevent them from crawling your web pages. More on this later.
Perhaps the most common way to instruct search engines not to crawl certain pages of your website is the robots.txt file. You can use this file to tell all the search engines, or just some of them, not to crawl specific pages. You just give the URLs of the pages you don’t want to be crawled and specify which search engines are not allowed to crawl those pages.
The noindex meta tag is a bit different than the robot.txt file. It tells the search engines not to show a page in their index. They’ll still crawl it, but they won’t show it in their index so anyone searching for a key term will not see that page on that search engine. Again, you can specify specific search engines or make it general for all search engines.
The nofollow meta tag is a tag that tells the search engines not to crawl certain links. So you can actually have a page that links to one other page on your website and make that link a nofollow link then the one page that spins off will not be found because of that nofollow link. You can nofollow all the links on a page or just some of them.
Finally, if you password protect certain pages, the search engines will not crawl them. They cannot guess your password so those pages are safe. Users of your website can get to them, but the search engines cannot. You can password protect your pages using the htaccess file that I discussed earlier.
Keep in mind that there are complications with each of these methods. The safest and most powerful of all of these methods is the htaccess. The least effective is the nofollow tag because while the links aren’t followed, that page is still on a server somewhere. If you access that page from your browser then move on to another page on your website and you have an analytics program that shows links for referrers, that link could get crawled and you’ll still get traffic to the page. Not a lot, but some, and you’ll run the risk of someone else linking to it. You have the same problem with noindex tags and robots.txt files, so be careful.
For more information on preventing your pages from being crawled, watch Matt Cutts’ video on that topic. He also discusses how to de-index certain URLs you have mistakenly indexed.