XML Sitemap Best Practices
Last Edited September 11, 2023 by Garenne Bigby in Sitemaps
In the process of optimizing websites, an extremely important part of the puzzle is submitting the sitemap. The purpose of the sitemap is to ensure that search engines will discover all pages contained on the website, and will download them quickly when changes have been made. Here, you will discover more about why sitemaps are important, how to optimize them for search engines, and when to use an XML sitemap and RSS or Atom feed.
Sitemaps and RSS/Atom Feeds
A sitemap can be in an XML, RSS or Atom format. It is important to note the difference between the formats. And XML sitemap will describe the entire set of URLs within a website, while an RSS or Atom feed will only describe the recent changes. The implications of this are that:
- XML sitemaps will be large while RSS and Atom feeds are small and will contain only the most recent updates to the website.
- XML sitemaps will be downloaded less frequently than an RSS or Atom feed.
To ensure the most optimal crawling of your website, it is recommended to use both an XML sitemap as well as an RSS or Atom feed. The XML sitemap is made to give Google information about all of the individual pages on the website, while the RSS or Atom feed will provide all new updates to the website to the search engine, and will help Google to keep the content fresh within the index. It should be noted that submitting a sitemap or a feed will not guarantee the indexing of the URLs.
The Best Practices
Essentially, XML sitemaps and RSS Atom feeds are list of URLs that are attached to some form of metadata. The two most vital piece of information for Google are the specific URL as well as the last time of modification.
The URLs in an XML sitemap or RSS or Atom feed need to follow these specific guidelines:
- Only include the URLs that may be fetched by Googlebot. Commonly, the mistake is made to include a URL that is not allowed by robots.txt. This means that it cannot be procured by Googlebot.
- Also, do not include URLs of a webpage that does not exist.
- You must only include canonical URLs. Commonly, webmasters include the URLs of copied (duplicate) pages. This is a mistake. This practice will increase the load on a server without improving the indexing at all.
In an XML sitemap and RSS feed, you must specify a time of last modification for each URL in the sitemap. This modification time needs to be the last that the content contained in the page was changed meaningfully. If a change in the content is meaningful enough to appear in the search results, then the time of this modification is what needs to be present in the sitemap.
Do not forget to update or set the last modification time in the correct manner. The correct format is the W3C date time for XML sitemaps. Only modify this time when there has been a purposeful change in the content. Do not make the mistake of sending the last modification time to the most current time when the sitemap has been serviced.
More on XML Sitemaps
The sitemaps will contain URLs of all webpages on your website. Many times, they are large and not updated frequently. To maximize your XML sitemap, follow these guidelines:
- When utilizing a single XML sitemap, update it at least once a day if the website changes regularly, and then ping it though Google after you update it. When a website is pinged, this simply means that it is sent straight to the search engine, and is then returned with status information along with any processing errors that are present.
- When dealing with multiple XML sitemaps, maximize the number of URLs per sitemap. The limit per site map is 50,000 URLs or 10 MB uncompressed, whichever limit is reached first. You will need to ping Google for each XML sitemap that has been updated, or only once if using a site map index that is updated.
- One mistake that is common, is to put a small number of URLs into each XML site map, this makes it more difficult for Google to download all of the XML site maps in a reasonable amount of time.
XML Sitemap Elements and Definitions
<urlset>: this tag is required, and is the document-level element within the sitemap. The remainder of the document after the <?xml version> tag (or element) has to be contained within this.
<url>: this should go without saying, but this element is vital and definitely required. This is the parent tag for each individual entry.
<sitemapindex>: also required, and is a document-level element included in the sitemap index. The remainder of the document after the <?xml version> tag should be contained in this as well.
<sitemap>: definitely required. This is the parent element for each individual entry contained inside of the index.
<loc>: required as well, this element will provide the full URL of the sitemap or page including the protocol and trailing slash (if this is required by the website's hosting server). This value needs to be no more than 2,048 characters. Do know that any ampersands in the URL must be expressed as &
<lastmod>: though not required for an XML sitemap, this will show the date in which the file was last modified. It may be displayed in the full date and time mode, or just in the date format.
<changefreq>: this is not required, but it tells how frequently the web page might change, such as always, hourly, daily, weekly, monthly, yearly, or never. When choosing “always”, this means that the documents will change each time that the website is accessed. “Never” is used when the files are archived, meaning that they will not be changed again in the future. This element is utilized only as a guide for crawlers, and does not determine how frequently a website is indexed, and it does not apply to <sitemap>.
<priority>: not required, but this element will display the priority of a specific URL in relation to other URLs on the website. This element will allow any webmaster to suggest to crawlers which page may be more important. The range is valid from 0.0 to 1.0, where 1.0 is the most important. The default value for this element is 0.5. It should be noted that an attempt to rate all pages on a website as high priority will not affect their listing in the search engines, as it only suggests this to crawlers and how important the pages are in relation to one another on a single website. This does not apply to the <sitemap> elements.
Support for the required elements is widespread, while support for those that are not required will vary across each search engine.
More on RSS and Atom Feeds
These feeds should show the most recent updates to your website. Generally, they are small and frequently updated. Also recommended for these feeds:
- Adding the URL and time of modification to the feed when an existing page is meaningfully changed or a new page is added.
- All updates should be in the RSS/Atom feed so that Google will not miss them. A great way to do this is to enlist the help of a hub. A hub will be able to pass on to new contents of the page to all RSS readers and search engines quickly and efficiently.
Using both XML sitemaps as well as RSS and Atom feeds is a great way to positively modify the crawling of a website for search engines, including Google. The vital information contained in these files is the canonical URL as well as the last time the pages were modified within the website. When both of these elements are used properly, they will notify search engines through the sitemap pings and feed hubs. This all allows the website to be crawled with optimum accuracy, therefore it will be accurately represented within the search results.
Using XML Sitemaps and RSS/Atom Feeds Simultaneously
When a website uses both XML sitemaps and RSS or Atom Feeds, it provides maximum coverage and extended discoverability for search engines. XML sitemaps need to contain only canonical URLs for the site, while the feeds will only contain the latest additions or the URLs that have been recently updated. Canonical URLs are the URLs that visitors will see. Many times the canonical URL will be used to describe the website's homepage.
One may wonder when exactly they should use both XML sitemaps and RSS or Atom feeds for their website. The benefit is that Google will prioritize new or recently updated URLs on your website. Google has noted that by employing RSS, they can be more efficient at keeping their index fresh.
Both protocol and subdomains can impact how URLs contained in a sitemap get indexed and crawled. The URLs that are included in XML sitemaps have to use the same protocol and subdomain that the sitemap is using. To be precise, the https URLs that are located inside of an http sitemap will not be included in the sitemap. This is in the same vain that a URL on example.domain.com will not be located on the sitemap for www.domain.com. This problem is seen in many websites that employ many subdomains or they have sections that are prefaced with http and https, like an ecommerce site. Many sites have started to change all URLs to https, but do not change the XML sitemaps to reflect this change. It is recommended to check any XML sitemap whose website has been changed recently.
Additional XML Sitemap Tips
There will be times in which a website will have pages that have different languages. When this is the case, many webmasters use hreflang. Using this tag, it makes it possible to tell Google which pages are to target which languages. Google can then surface the right pages based on the language or country of the person that is searching Google. It is possible to either provide the hreflang code in the page's HTML code—page by page, or you may use the XML sitemap to supply the hreflang code.
Testing the XML sitemaps or other feeds can happen through Google's Webmaster Tools. There is a simple button that is used for this process. This functionality in the webmaster's tools can find any problems quickly and efficiently.
When choosing to incorporate the RSS or Atom feeds in with a sitemap, these syndication feeds will supplement the complete sitemap, as all new information is updated into these while being crawled.
Crawling the XML Sitemap
XML sitemaps need to be tested thoroughly before they are actively implemented to make sure that they will run smoothly once they go live. Many people choose to do this by crawling their own sitemaps. While doing this, it is possible to identify any tags that will give problems, any header codes that are non-200, and other issues that may have been overlooked. There are websites available to crawl a sitemap for the webmaster, it is just up to the user to determine how they would like to do this.
XML sitemaps can be used in many ways to maximize the SEO efforts of a website. When you understand exactly how and why XML sitemap work, it enables you to inform the search engines of all relevant URLs on the website, and how to use them versus an RSS or Atom Feed. XML sitemaps are fed directly to search engines, so it is vital that they are done right before they go live—especially for those websites that are larger or more complex. Ideally, a webmaster would choose to implement both an XML sitemap as well as a syndication feed in order to ensure that their website has the best structure possible, as well as to ensure that all new content is able to be discovered through the search engines.
Categories
Create Visual Sitemaps
Create, edit, customize, and share visual sitemaps integrated with Google Analytics for easy discovery, planning, and collaboration.