May 25

Ultimate Guide to Robots.txt

Whether you’re developing your first website or you’ve been running one for a while, you’ve probably heard self-proclaimed SEO “experts” on forums and comment sections talking about the “robots text” file. Maybe you’re a little confused–just what is the robots.txt file, and how should you be using it? Should you even be using it?

Never fear! We’re here with another handy guide that will walk you through this critical element of SEO success, all in terms that nearly anybody can understand.

An Analogy to Get Us Started

This will sound like a random question at first, but we promise that we’re going somewhere with it:

Did you ever see the old television show “Mission: Impossible,” or the revamped movie series with notorious actor/Scientologist/Oprah’s-Couch-Jumper Tom Cruise? If not, all you need to know is that it was a spy television show that typically began with the main character being told his mission for the week–“Your mission, should you choose to accept it is ______.”

Your Robots.txt file is kind of like that classic television introduction, except instead of giving directions to a spy, you’re giving them to a search engine. This text file says to Google, “Your mission, should you choose to accept it is to index the following pages of my website–but stay away from these other ones!”

Maybe you’re more of a literary person. In that case, much like the classic novel Jane Eyre, your robots.txt file tells you which rooms it can and cannot enter–it says, “You must never go in there, Google!”

Nike's Robot.txt file

Robots.txt for the Absolute Beginner

Okay–silly pop culture references are fun, but let’s define the robots.txt file more precisely–if you already have a reasonable understanding of the robots.txt file, feel free to scroll down to the next section:

Your site’s robots.txt file tells Google what you want to be crawled and what is forbidden. These instructions include telling Googlebot, Google Imagebot or any other and any other crawler or bot trying to crawl your website for content. The file can also define parameters to guide the bot’s behaviour including slowing down the crawl rate.

This guide will cover how you can optimise your robots.txt to boost your site’s performance, your search engine ranking (SERP), and also how to keep specific pages private and inaccessible.

Once you familiarise yourself with the primary directives of this file, robots.txt might seem pretty simple, but even a small mistake can cause a lot of damage to your site’s functionality. It’s important to understand all of the ins and outs as well as the potential pitfalls of your robots.txt file.

Why Would You Want to Block Certain Pages?

Perhaps you’re thinking, “I’ve got nothing to hide on my website–why would I ever care if a page was visible/indexed or not?” Well, it’s not only a matter of privacy and concealing potentially damning or personal info–it’s a matter of your site’s functionality.

There are many reasons you might want to block pages on your website from being indexed. Here are a few of the most common directories and pages you might want to keep private:

  • Admin pages, /wp-admin/
  • Bins, like /cgi-bin/
  • Page scripts, i.e. /scripts/
  • Shopping cart info: /cart/

You can also use robots.txt to eliminate the redundancies posed by duplicate pages and other subcategories and folders within your site that carry plenty of URLs, but not much valuable information for somebody looking up your site on Google.

This is by no means an exhaustive list of the pages that sites might choose to keep un-indexed, but it’s a good start for beginners. Now, let’s get a little more technical.

What is the Robots.txt File’s Preferred “Language”?

Now that you’ve got some idea of the information you may not want to be indexed. Let ‘s look at how to accomplish this goal.

We’ll start with the good news: you don’t need to know HTML, Java, or any other coding languages to create a robots.txt file that will meet all of your site’s needs. You will need to learn the directives required for this file, but the learning curve isn’t as steep as becoming “multi-lingual” in computer-speak.

The language required of your robots.txt file is, unsurprisingly, straightforward and robotic. Below, we’ll show you how to communicate this information in the right way.

Robots.txt is sometimes known as the “Robots Exclusion Protocol”–if, to paraphrase “The Big Lebowski” you’re not into the whole “brevity thing.” It is one of your site’s most essential elements, and it’s stricter than an old Catholic School nun, so watch out for your knuckles!

On the plus side, there’s no grey area to the robots.txt file: something is either a 1, or it’s a 0 (or, in this case, “allowed” or “disallowed”).

Why so serious? Well, some of the protocol’s ironclad rules are due to how early in the internet’s history it was created–search engine spiders weren’t as advanced then as they are now, but the Robots.txt file has stuck around.

Do I Need a Robots.txt File?

Those running small, simple sites understandably want to minimise time spent coding and doing other work related to the site’s functionality. We’re not calling them lazy or judging this desire at all–we’d still suggest creating a robots.txt file, however.

After all, you never know where your site will end up in the following weeks, months, or years. It might grow considerably more prominent and more popular in time, and a properly-written robots.txt file acts as a sort of “SEO “preventative medicine,” keeping your search results from getting too unruly as you build your site.

Learning the Rules of robots.txt

Robots.txt files are all about rules, so it’s important to know how to write them in the right fashion. You can write these in nearly any text editor that produces ASCII and UTF-8 files; don’t use word processors like Google Docs or Microsoft Word.

Rule #1

User-agent: Googlebot (or whichever bot you’re “talking to”)
Disallow: /index.htm (or whatever part of your site you want to disallow)

Rule #2

User-agent: Googlebot
Allow: /

^This example would allow Googlebot to crawl your entire site–an option best left to smaller sites without many images, comments, and other information.

To “talk to” every searchbot, use the symbol * for User-agent. Not every engine responds the same way (as we’ll look at later, Google decides that it makes the rules), but the asterisk is a good place to start for beginners.

Noindex

While “disallow” prevents the page in question from being crawled, it may still get indexed (especially by Google). The noindex directive takes thing a step further. The following example tells every search engine to keep the /secret.htm/ folder from being indexed:

User-agent: *
Noindex: /secret.htm/

How Your Site’s Robots.txt File Communicates

Search engines like Google index websites by “spidering” them (following links from site 1 to 2 to 3, etc.). When the spider encounters a new domain, the Robots.txt file is the first thing it looks at–this is where it gets its “marching orders.”

Understanding How Search Engines Work

Let’s expand on the concept we presented above. Simply put, a search engine has two real jobs:

  • “Crawling” through the web to discover and catalogue content (spidering, for those “in the know”)
  • Creating an index of that content to present to online searchers

When the search engine first encounters a new site, it looks around to see if the site has a robots.txt file. If the site doesn’t contain one of these files, the spider proceeds to crawl and index the whole site.

For small sites, this might not present much of a problem. For larger ones–especially those that contain comment sections, countless articles and images, and even an internal search feature–this can cause a jumbled mess.

Page redundancies (when duplicate pages are indexed) essentially cause you to become your competition on a Google search, lowering your sites overall efficacy. If someone searches for a page on your site called “The Best Chocolate Biscuit Recipe in the World” and it pops up four times on Google, not only is this confusing for the reader, it wastes your sites “page allowance” set by Google and other engines.

To quote hip-hop producer/Zen philosopher DJ Khaled, “Congratulations–you just played yourself.”

Notable Pros and Cons of Robots.txt

Given that the robots.txt file was dreamed up in the middle of the 1990s, at a time when Ace of Base was telling us how they’d seen “The Sign,” and baggy jeans and midriff-baring shirts were all of the rages, some distinct pros and cons have appeared with age.

Pro: Budgeting Your Site’s Indexed Page Allowance

Every site on the internet gets a specific allowance (much more generous than what your parents gave you as a child, most likely). This allowance dictates how many pages a search engine will crawl and index for a specific site.

If you block some of your pages, you can save your budget to use in other (more important) parts of your site. For example, maybe there are significant portions of your site “under construction,” that doesn’t reflect the overall quality of your other pages. If your SEO isn’t quite up to snuff yet, it’s beneficial to block the search engine from these parts of your site.

Streamlined and Optimized Search Engine Presence

By removing duplicate content, repeat results, and irrelevant information like internal search pages, you can present the best of your pages to Google, Bing, or whichever search engine your customers might be using. This in effect will improve your SEO.

The other, secondary pages would likely still be available to someone who knows what they are looking for, but somebody who just entered “(Your Site)” into the search bar would get only the “cream of the crop.”

Con: Accidentally Blocking Content

A quick perusal of Quora and other question sites regarding the robots.txt file will give you a look at a lot of frantic site admins trying to figure out what went wrong. Usually, the story goes something like this:

I updated my robots.txt file, and now users are telling me they can’t find me on Google. Plz, help!!!

As we mentioned above, it’s easy to make mistakes within this file–whether through renaming it, relocating it, or creating the file in a word processor.

Con: Your Robots.txt Is Visible to Potential Hackers

If you were a pirate on a treasure hunt, you’d probably keep your treasure map somewhere safe, and not publish it in a book, newspaper, or prominently on the ship’s mast to tell the whole world where your booty is hidden. By the way–we’d like to join your pirate crew–as fun as helping people learn the fundamentals of SEO is, we too yearn for a life of adventure, rum, and scurvy (okay, maybe not scurvy).If you see an SEO company’s website, a good place to check for open positions or careers opportunity is their Robots.txt file. wink wink.

Why do we bring this up? The fact is: anybody “hiding” secure information online needs much more than a well-constructed robots.txt file to keep that information from getting into the wrong hands.

Because robots.txt is a publicly visible area of your site, it’s considered somewhat of a treasure map. Encryption, password protection, and other tools are still needed to keep user and admin data under wraps!

Crawl Rates and Crawl Demand

Search engines–specifically Google–primarily crawl your site, but also aim to provide prompt, fast service for searchers.

If your server is down, having troubles, or is merely slow, Google automatically adjusts its “crawl rate” over your pages. Your robots.txt file allows you to meet this problem head-on by delaying the crawl over pages that contain lots of data, high amounts of traffic, or other potential issues.

Slow down crawl rate for Bingbot on robotx.txtCrawl rates and demand are also frequently increased if you move your site to a new domain–Google naturally “ups” its efforts to start re-indexing all of your information. This can cause many glitches for you and readers/customers.

Creating a Robots.txt

While there are many tools to provide automatic robots.txt files for websites, we recommend building your own; the level of oversight, customisation, and efficacy is easily achievable on your own.

First, enter your URL. Then, place /robots.txt after it–i.e. http://www.yoursite.com/robots.txt

Do you have any information there already? Good–it’s most likely set to a pretty standard default setting.

If you’re not sure how to find your site’s source code, a good first bet is to go to your hosting website, log in, and navigate to its file management section. Your robots.txt file should be one of the first files you see.

To customise your text file, first, delete the original text.

If you have an empty or nonresponsive 404, you need to fix that. Open Notepad (if you use Windows) TextEdit (for Mac people) or a free online plain text editor like Editpad.org

Write your desired directives in the format we showed you above. For example, to disallow all search engines from indexing your wp-admin page, you’d type the following:

User-agent: *
Noindex: /wp-admin/

That’s about it! You can write however many directives you want, and you don’t need to “end” the text with any special wording, code, or post-script. Seek out the help you need if you’re having trouble figuring out how to accomplish a certain task on your robots.txt file.

Testing Your Robots.txt

With all of the hard work you’ve done on optimising your site for search engines, it’s essential for you to not throw that away with a robots.txt file that sabotages your site.

This resource from Google allows you to test whether or not this is the case. Just submit the URL you want to check and this Google testing tool will highlight potential errors and problems with syntax. It’ll let you know when your robots.txt file is operating correctly.

Note that these changes are not added or saved to your site; you’ll have to do this separately on your own. Luckily, that’s as simple as copying and pasting the information into your site’s robots.txt file (replacing your outdated commands, of course).

How Google Continues to Change the Game (for Better or Worse)

To paraphrase J.R. Tolkien, Google will seemingly not rest until they’ve become the “One Search Engine to Rule Them All.” This single-minded desire has led the tech giant to make new rules, break old ones, and in many cases, simply decide that the rules don’t apply to mighty Google. Robots.txt files are one of

Over the last couple of years, Google has started to change the rules (for itself, at least); they’ve decided that they can, at least partially pretend like your robots.txt file doesn’t apply to them.

They’ve made these changes in the name of “efficiency” and “effectiveness”–in reality, it seems a lot more like they want to be the ultimate arbiter of what’s relevant and what’s not within your website’s pages.

The compromise they’ve made, however, at least has one upside: they are willing to honour robots.txt disallowed pages by not describing the “blocked” pages. How courteous of them!

While every effort has been made to present up-to-date, accurate information within this guide, it’s still a good idea to check in with Google, as their monolithic policies can change depending on the way the wind is blowing. Here’s the link to their master list of robots.txt and other guidelines.

Takeaway Points for Your Robots.txt File(checklist).

Remember the song at the end of “The Breakfast Club,” the one that goes “Don’t you…forget about me!” Here are some points to remember whenever you deal with your site’s robots.txt file.

  • Name the robots.txt file “robots.txt.”
  • It’s important to make the robots.txt file accessible by placing it in your site’s top-level directory
  • Be careful when you change your robots.txt file–this can block vast portions of your site from search engines. With all of the hard work you’ve done to optimise your website, don’t let that hard work go to waste!
  • Don’t use the crawl-delay directive unless you need to (i.e. if you have a large site with a lot of visitors)
  • Your robots.txt file will be readily available for those interested in it. Repeat: this is not private information, so don’t store any of your deep, dark secrets in there.
  • Subdomains within a single site each require their robots.txt file

Technical Terms You Need to Know

User-agent–The search engine “bot” you’re giving directions to
Disallow–The command that tells the bot not to crawl a given page or subset of pages
Allow Allowing the bot to index the page, showing it (and a description) on the search engine
Noindex Tells the bot not to index the page or subset of pages.
Crawl-delay A directive for the search engine to wait a given number of milliseconds before loading/crawling a page.

Whew! We know that we just threw a lot of information at you; don’t worry about absorbing it all at once¬†understand the importance, the function, and the limitations of the robots.txt file, and your site will be much better

Recommended Reading

About The Author

Ajay Chavda is the co-founder of Weboptimizers, an SEO agency in Melbourne and has been involved with SEO for over 15 years. Between the digital properties and security forums he has managed, his articles have been read by approximately 50 million unique visitors.

Leave a reply

Your email address will not be published. Required fields are marked *