Most businesses, website owners or information technology (IT) professionals are concerned with their presence and ranking on the Internet. Pretty much anybody or any organisation that has a website presence or maintains digital content on the web may be concerned with their rankings on search engines such as Google or Bing (i.e., their place within the search results on these sites). Therefore, management of where they “place” in these rankings has become a significant focus for anyone with content on the internet.
In the mad race to manage where these rankings place a site within these search engines, the natural next step and extension is Search Engine Optimisation (SEO). There is a whole industry around SEO and the art craft of managing where you place within these rankings, as well as the necessary steps to improve those placements or ranking. In essence, that is what SEO is. It is the management and optimisation of where you can make improvements to your site to move up your ranking. However, to perform this site optimisation, you have to fully understand what it is the search engines do to find your site and content. That is where a web crawler and your site’s crawlability come in. By understanding what a web crawler is and what your site’s crawlability mean, you can then take steps to move towards that correct SEO functionality. This guide will take you through web crawlers, crawlability and how they tie in with SEO.
What is the (Web) Crawler?
To discover online content the search engine sends out little probes called crawlers, bots or even spiders. (Hence, the name “web crawler,” not to be confused with the Marvel Comics superhero, Spiderman.)Once the crawler crawls the site, a copy of the site is stored with the search engine who then use a sorter, a processor and then an indexer or other mechanisms to index the site.
First, the search engine finds out about a particular site (this is the discovery phase). Next, the search engines have large computers with extensive lists of all the Internet sites they want to explore. These crawlers go from site to site, as specified on that lengthy list, and poke around trying to find out content and relevancy on that site. This process runs 24/7. How these crawlers go about finding this content and relevancy is part of the crucial relationship that they have with SEO.
These crawlers discover this content, and ultimately relevancy, by examining the structure of each site. Theoretically, the search engines could go and ‘scrape’ all the content (i.e., HTML, text, images, etc.) from every page that exists on the Internet. But this would be a very tedious job, and not all content is meant to be exposed to the public’s view either. Thus, most webmasters will add a discoverable robots.txt file to their site. These crawlers know to look for these files first on the sites from their long list. Within the robots.txt file, the webmaster specifies which areas of their website the search engine crawlers should, and are allowed, to gather information.
The mapping assists the gathering of information by these crawlers that the robots.txt files provide. This process helps the information gathering become more efficient because instead of having to look everywhere on the site (e.g., the HTML, text, images), the crawlers are directed where to go and where to look. The crawlers will look to return HTML regarding the site/version of a particular page and its content back to the search engine. Once the search engine has received the information back from the crawler, that site’s content will be added to a sizeable index (that goes along with the extensive list mentioned earlier above). The index itself is a database of all the HTML content that these crawlers are returning back with.
Again, the crawlers are continually working (like the ‘busy bees’ of a hive), and as part of this process, they will return to a particular site over and over again. Each time the crawler comes back to the same site, the crawler compares the new HTML information that it has to the version that is already saved in the index. If new content is available, then the index gets updated. If not, the crawler moves forward.
However, the frequency of how often these crawlers will come back to a particular site depends on in part how much new content is periodically available. For example, if your site does not update its content that frequently, then the crawlers will notice this when they compare it to the previous version of the site’s HTML already stored in the index. The search engine takes note of this and may deem your site to be not as important as another site that has more frequent updates on content. In this instance, the search engine directs the crawler to check on your site, but not as frequently as the other site that does more content updates. Ultimately, this downgrade in importance or less visits from the search engine’s crawler can impact your site’s SEO rankings.
What is Crawlability and What Factors Affect It?
Taking a more in-depth look at crawlability starts with a more detailed explanation. The crawlers mentioned have a job to do. How well and fluidly they can accomplish their given tasks on a given website determines that site’s “crawlability.”
A site’s crawlability can be said to start with the site’s invitation or disallowance of the crawler to scan the site. Think of it as if Cinderella does not get an invitation to the ball, then she is not supposed to go. Again, this inviting of a crawler’s ability to scan the site is done via the robots.txt file. If there is no such TXT file on the site, then the crawler does not know whether or not it should scan the site or move on. (Whether the crawler will scan the site without the robots.txt file or move on depends on the various search engine’s protocol.) But with the robots.txt file absent, any scanning of the site would be made much harder and more time-consuming. So this is another factor that not only affects a site’s crawlability, but it also folds into that same site’s SEO rankings.
If the robots.txt file is present, that presents a (site) map of where the crawler will go to gather up information to report back to the index. Crawlers start with this robots.txt mapping initially but bounce around a site by jumping from hyperlink to hyperlink to accomplish its mapping activities. If the crawler can not fluidly follow these links (i.e., if there are either bad or broken links on your site), the crawler will not be able to crawl your site effectively, in turn affecting your site’s SEO rankings.
Another site-style ‘gatekeeper’ that determines whether or not a crawler will be allowed to do its work is the HTTP header. Within a page’s HTTP header, a status code is present which will direct the crawler to either proceed or let it know that it is disallowed. Also, if this HTTP status code says that a page does not exist, then the crawler will not crawl the site/webpage.
Still, another layer of directive security may exist before the crawler gets down to work. There could be a robots meta tag within the HTTP header that disallows the crawler from indexing that specific page. In this case, most search engines will still crawl that particular page, but will not add its content back to the search engine’s index.
Tools and Tips
There are tools and tips out there to help improve your site’s crawlability, indexability, and ultimately, leading to improved SEO rankings. Some are third-party add-ins to assist with this SEO process, while others involve implementing just proper site maintenance into play.
With the third-party add-ins, as noted above, there is a whole industry associated with this SEO ranking management aspect of web presence. As part of that industry, third-party tools are available as plug-ins for your site to help improve the site’s crawling, indexability, and ultimately, SEO ranking. Usually, for a fee, these coded solutions often act out the search engine crawler’s functionality to help ensure that the crawler has an easier job of things when it comes around to your site. Some plug-ins even interact with these crawlers to help it accomplish its task more efficiently, as well as, getting your site to rank higher in search results.
A non-programmed ‘tool’ of sorts is to hire an individual or company experienced with SEO ranking functionality and methodology. SEO rankers are experienced in what it takes to get your site fine-tuned for this crawler and crawlability aspect of web presence. Their work often yields increased SEO rankings and a ‘cleaned up’ site, in regards to the crawler and its activities.
There are numerous ‘tricks of the trade’ that can be undertaken to fine-tune your website as well. These adjustments throughout the site will help the crawler, and crawlability, ensuring greater indexability and resulting SEO ranking.
Some of these fine-tuning steps include:
- Managing site structure (including the interplay of the robots.txt, robots meta tags and HTTP header aspects)
- Verifying the working natures of internal links
- Dealing with any looped redirects that might be present on the site (as these prevent crawlability)
- Any other navigation preventing server errors (these also prevent the crawler from being able to move around the site properly).
Finally, another tip for improving SEO rankings via crawlers is to update content on the site frequently. As noted above, the search engines will send out crawlers to index your site more frequently if they see frequent need to catalog new content.
Hopefully, the process for developing SEO rankings has been unveiled some with this article, including how crawlers and site crawlability factor into this overall ranking process as well. Outside of the goal of maintaining or improving SEO ranking, the tips outlined above for enhancing crawlability will promote the overall functionality of your site. Improved functionality means a happier visitor experience, and this too can help your SEO rankings because the more visitors a website gets, the more the search engines will want to crawl that site.