Crawl Budgets. How to Stay in Google's Good Graces

There have been a number of different definitions floating around for the term “crawl budget”, but there is not one single term that can describe everything that a crawl budget stands for on the outside. Here, we will try to clarify what it means, and how it relates to Googlebot.

First, it should be noted that the following description of crawl budget is not a mechanism that most publishers will ever have to be concerned about. If a new web page has the tendency to be crawled on the same day that it is published, a webmaster will not necessarily have to be concerned about crawl budget. At the same time, if a website has less than a few thousand URLs, often it will be crawled efficiently. Larger websites and websites that auto generate pages based on given URL parameters will see the importance of prioritizing what should be crawled, when it should be crawled, and how much resource should be allocated to crawling from the server hosting the website.

The Crawl Rate Limit

Google bot has been designed to be a positive force on the web. Crawling is the main priority, while also ensuring that it does not degenerate the experience of the user that is visiting the website. This is called the crawl rate limit, which will limit the maximum fetching rate for the given site. To put it simply, this will represent the number of simultaneous parallel connections that Googlebot can use to crawl the website, in addition to the time that it has to wait between this type of fetching. The crawl rate will vary up-and-down based on a few things; if the website response quickly for a while, the limit will go up and more connections can be used to crawl. If the website slows down or gives server errors, the limit will go down and Googlebot will crawl the website less. This is the called the crawl health. Additionally, limits can be set in the search console: website owners are able to reduce the amount that a Googlebot is crawling their website. It should be noted that setting a higher limit will not automatically increase crawling frequency.

The Crawl Demand

Even when the crawl rate limit has not been reached, if there has been no demand to be indexed, there will not be much activity from Google bot. Popularity and stagnation play an important role in determining the crawl demand. URLs that are seen as more popular on the Internet will be crawled more often to keep them fresh on the index. Alternatively, Google's systems aim to prevent URLs from getting stale on the index. URLs that are not updated frequently by their webmasters may not be crawled frequently, as there is no new information that Google needs to be aware of. These sites have a low crawl demand.

Also, events that occur sitewide, like a website move, can trigger an upsurge in crawl demand so that the content is indexed under the new URLs. When considering the crawl demand and crawl rate with each other, a crawl budget is defined as the amount of URLs that Googlebot can and desires to crawl.

Crawl Budget Factors

Analysis has shown that having too many low value add URLs will negatively affect the crawling and indexing of a website. The analysis also found that the low value URLs will fall into these categories: on-site duplicate content, faceted navigation and session identifiers, soft error pages, infinite spaces and proxies, hacked pages, low-quality and spam content.

When server resources are wasted on pages like these, it will drain crawl activity from web pages that actually hold value. This can cause a significant delay in discovering quality content on the website.

Optimization for Crawling and Indexing

There are hundreds of new websites being created on the internet each day, and Google only has a finite number of resources. As Google is faced with the near infinite number of content that is presented online, Google bot can only find and crawl a portion of that content. And of that content, only a portion can be indexed. URLs act as a bridge between a search engine's crawler and a website—crawlers need to be able to cross the bridge (find and crawl the URL) in order to be able to find the content of the website. If the URLs are too complicated or are redundant, crawlers will just end up retracing their steps unnecessarily. When URLs are neat and organized and lead directly to the intended content, crawlers will spend their allotted time accessing the content, as opposed to weeding through obsolete pages or looking at the same content over and over again on various URLs.

You should remove any user-specific details from the URLs. This also goes for session IDs and sort orders. As they are removed from URLs they can be applied to cookies. By doing this and then redirecting to a cleaner URL, you will retain the information that is needed and can reduce the number of URLs that point to the same content. Resulting in more efficient crawling.

You should aim to disallow actions that Googlebot cannot perform. Utilize the robots.txt file and disallow things like crawling login pages, shopping carts, contact forms, and other pages that have the purpose of doing something that is impossible for the crawler to do. It would be wise to have the crawler ignore things like that and spend their time crawling the content that actually means something to search engines.

One URL should only have one set of content. In a perfect world, there would be a one-to-one pairing of URL and content. This means that each URL would lead to a unique piece of content, and that piece of content would only be able to be accessed through that one URL. The closer that this can be done, the better, and the more streamlined that the website will be regarding crawling and indexing. If the CMS or site setup makes this hard to do, you can use the rel=canonical element that will indicate what the preferred URL is for a particular piece of content.

Control in infinite spaces. Does your website link to something like a calendar that has an infinite number of past and future dates that have their own unique URLs? Does your website have paginated data that gives a status code 200 when &page=3563 is added to the URL? Even when there are not a whole lot of pages? If this is the case, you probably have infinite crawl space on the website. Crawlers may be wasting bandwidth trying to successfully crawl it all.

More to Know

Crawling is how websites make it into Google search results. An efficient crawl of a site will help when it is indexed in Google search. Then, when a website is properly indexed, it is able to appear correctly in the search engine results page

When a website is made to perform faster, it will improve the user experience while simultaneously increasing the crawl rate. For Google bot, a fast website is just one sign of a healthy server, so that it may get more content through the same amount of connections. Alternatively, a high number of 5xx errors or connection timeout errors will indicate the opposite, and crawling will slow down. Any user that is concerned about this should refer to the crawl errors report within search console.

An increased crawl rate does not necessarily mean that a website will have a better position within search results. Google utilizes hundreds of signals in order to rank the results like the quality of the content and so forth. It is true that crawling is vital for showing up in the results, but it is not a ranking signal.

In general, any URL that Googlebot has crawled will count toward a website's crawl budget. Alternate URLs, including AMP or hreflang, or even embedded content, like CSS and JavaScript, will need to be crawled, that's consuming a website's crawl budget. Also, long chains of redirecting will have a negative effect on crawling. If it is not necessary to have more than one redirect, then this should not happen. It is frowned upon for users and will use up a decent amount of the crawl budget.

When considering the crawl delay directive for Google bot, it cannot be processed by Google bot, this is of no use.

Any URL that has been crawled will affect the crawl budget, so even when the page has marked a URL as nofollow, it may still be crawled if a different page on the website or on the Internet as a whole does not label the link as nofollow.

With the Crawl Budget put in place, Google will prioritize what to crawl, when to crawl it, and how much resources the server that is hosting the website can give to crawling. This is more important for websites that are larger, or for those that automatically generate pages based on URL parameters. Think of it like this. You have a filing cabinet full of documents. Some are documents with two copies or several copies, others are original documents only. You have a certain amount of time to go through all of these documents (and their copies) one by one and file them appropriately. It would take a significantly less amount of time to complete your task if there was only the original documents to sort if the copies serve no purpose. As a website might have duplicate content problems, the other content may not be crawled and indexed as accurately, meaning that it is not reflected in the search engine results.

A crawl budget should not be wasted on duplicate content or content that has little meaning. Do what you can to save it for the good stuff. This is something that might not occur to those who are new to putting content on the web—some may think that the more times that it occurs then the more likely it is to be seen. This could not be farther from the truth. Content keeps its importance when it is unique, factual, and is good quality. If the same content shows up over and over again on the web, the quality sort of thins out and it stops being unique and may lose some credibility. Knowing what a crawl budget is and how it works does not only benefit crawling and indexing. When you understand how it works and how to build a website around it, benefits will be seen for the website as a whole. Crawl rate limits can be improved by making sure that the server that the website is on is working to be as responsive as possible. One of the most recommended ways to do this is to configure page caching through the use of W3 Total Cache or a similar solution and choosing a host that uses RAM based caching like TMDHosting or SiteGround. A higher crawl rate will in fact help all of the pages in a website to be indexed, Google has made it clear that a higher crawl rate does not equate a higher ranking in search engine results pages. With that being known, it can be assumed that when these actions are taken to optimize the crawl rate, that a website may also see a slight benefit regarding ranking. This is simply due to combined factors like reducing duplicate pages, and so forth.

Knowing what a crawl budget is and how to utilize it is just one more small thing that webmasters can do to keep the health of their site the best that it can, whether it is implemented right at the conception of the website, or if it takes place during a website redesign or clean-up. Google could not care less whether or not a site is aware of a crawl budget as crawling and indexing is all automated, but the actions of a webmaster can ensure that Google bot works efficiently for their site.

Author: Garenne BigbyWebsite: http://garennebigby.com

Founder of DYNO Mapper and Former Advisory Committee Representative at the W3C.

Back

Create Visual Sitemaps

Create, edit, customize, and share visual sitemaps integrated with Google Analytics for easy discovery, planning, and collaboration.

Popular Tags

Search Engine Optimization SEO Accessibility Testing Create Sitemaps Sitemaps UX User Experience Sitemap Generator Content Audit Visual Sitemap Generator