Remediate.Co

Frequently Asked Questions About Sitemaps

Last Edited September 11, 2023 by Garenne Bigby in Sitemaps

faq sitemaps

Sitemaps are the best and easiest way for webmasters to inform major search engines about pages on their website that are open for crawling and indexing. Broken down simply, a sitemap is an XML file that is a list of URLs contained on each site, along with added metadata about each of the URLs so that the search engines can more accurately crawl these sites. This metadata can be anything from how often it changes, when it was last updated, its priority in relation to other pages on the site, and much more.

Search engine crawlers will generally discover pages from links that are contained on sites and so forth. Having a sitemap will supplement this data and allow the crawlers that support the sitemap to gather up all of the URLs from the sitemap and learn about those URLs as well as the associated metadata. Though using a sitemap does not guarantee that a web page will be included in a search engine's results, it will provide very important data to the search engine so that the crawler can perform more accurately. There are certain guidelines that are put in place so that webmasters create sitemaps that will be helpful to them in the long run. Whether you are creating the sitemap totally from scratch or by using a sitemap generator, please remember that there are guidelines in place to ensure a successful sitemap, which leads to a successful crawl and successful indexing.


How should time be specified?

You should use W3C Datetime encoding for the lastmod time stamps and all of the other dates and times that are to be used within this protocol. As an example, 2016-11-09T14:12:14+00:00. Encoding like this allows you to exclude the time component of the ISO8601 format. For example, 2016-11-09 would also be valid. It should be noted that if you have a website that changes on a frequent basis, it would be best practice to include the time component so that crawlers will be able to gather more complete information about the website and when it was last modified, especially if on a daily basis.


How are URLs represented within a sitemap?

All XML sitemaps are the same, and any data values (URLs included) have to use entity escape codes for specific characters like an ampersand, single quote, double quote, greater than, and less than. All URLs will also need to follow the RFC-3986 standard regarding URIs, RFC-3987 standard regarding IRIs, and the XML standard. When you are using script to generate a URL, you can normally URL escape them as a part of the script. Even so, it is still necessary to entity escape them.


How big can the sitemap be?

Sitemaps should not be any larger than 10MB and may only contain a maximum of 50,000 URLs. These limits are in place so make sure that web servers do not get overwhelmed by serving extremely large files. If a website contains more than the maximum number of URLs or it is larger than 10MB, you will need to create more than one sitemap file, and then place these files in a sitemap index file. It is suggested to use a sitemap index file even if your website is small currently bit you plan to grow it larger—meaning that it could potentially include more than 50,000 URLs or be larger than 10MB. A sitemap file index can hold up to 1,000 sitemaps, but cannot go over 10MB. Sitemaps can be compressed using gzip.


Does it matter which character coding method is used when generating a Sitemap file?

Absolutely—a sitemap file has to use UTF-8 encoding. The bots and crawlers that are reading the sitemaps are not able to read much beyond the alphanumeric characters, and especially not the more complex characters like punctuation marks.


Where should the sitemap be placed?

It is highly recommended by many experts that your sitemap is placed at the root directory of your own HTML server; it would look like http://www.sample.com/sitemap.xml. There may be a situation in which you would want to place varying sitemaps for different paths on the website, like if certain security permissions in your organization separate write access to various directories. It would be assumed that if you have permission to upload http://www.sample.com/path/sitemap.xml, that you would also have permission to access metadata unver http://www.sample.com/path/. All of the URLs that are listed within the sitemap have to be on the same host as the sitemap. Let's say the sitemap is located at http://www.sample.com/sitemap.xml, it is not able to include URLs from
http://www.subdomain.sample.com. If the sitemap is located at http://www.sample.com/folder/sitemap.xml, it cannot include URLs from http://www.sample.com


How does the “priority” tag in the XML sitemap influence the ranking of a page in search results?

The “priority” tag in the sitemap will only indicate the importance of the page in relation to the other URLs on your own site and will not affect the ranking of the page on search results. In short—marking a URL as priority will not give it priority over other public web pages all over the internet, the only priority that it takes is over the other pages that are contained within the single website. You are not given priority over other sites in the search results, only the single page is over your other pages.


What happens after my sitemap is created?

Once the sitemap has been created, you will need to let search engines know about it. This is done by directly submitting it, pinging it, or adding the location of the sitemap to your own robots.txt file. Once the search engine is notified of the sitemap, it is then able to crawl and index it. Once it has been indexed, it will sit within the search engine and then be recalled when a search query is performed and the site is a relevant match.


Do URLs within the sitemap need to be absolutely specified?

In short, yes. You will need to include the protocol (http or https) within your URL. You will also need to use a trailing slash in the URL if the web server does require one. As an example, http://www.sample.com/ is an okay URL for a sitemap, while www.sample.com is not.


Do I need to include the frameset URLs or URLs of the frame contents when pages on my site uses frames?

Yes, absolutely include both URLs for these. By including both, you are ensuring accuracy and success for your sitemap(s).


Do I need to list both http and https versions of URLs?

No. It is only necessary to list one version of a URL within the sitemap. When you include multiple versions of the URL on the sitemap, this could result in a fractured crawling of the site. When choosing which one to use, think about if you have canonicalized any of the URLs and go with that. Consistency will work in your favor so that confusion does not take over when selecting which version of the pages to use.


My website has millions of URLs, am I able to only submit those that have recently changed?

It is possible to list only URLs that change on a frequent basis in a small number of sitemaps and then you should use the lastmod tag within the sitemap index file to point to those sitemaps. This makes it possible for search engines to periodically crawl only those sitemaps that have been changed. When you limit the number of URLs that are being crawled, you are benefiting your bandwidth as well. That being said, if you feel the need to, you may submit as many sitemaps and sitemap index files that you feel necessary. As long as they meet all requirements and do not contain too many URLs (50,000) and are not too big (10MB), then do as many as you would like.


If URLs on the site have a session ID, should it be removed?

Yes, you will need to remove the session ID from the URL. If you include this, you could hinder the crawling of the site, or have it become redundant. The session ID within a URL is too specific of parameters for search engines to successfully crawl. It is totally possible to obtain a URL that does not have a session ID attached to it, and often times a session ID will actually change with each visit to the website, so that is not a stable URL, hence the problems that it may cause.


Does the location of a URL within a sitemap influence how it is used in the sitemap?

No, the position of a URL within a sitemap does not influence how it is used or interpreted by search engines or crawlers. This is to say that URLs that are placed at the beginning of a sitemap do not hold more importance than those placed further down on the list. The only thing that will influence their priority is when that tag is attached to the link in the proper way, as discussed below.


Do sitemaps have to be gzipped?

Yes, if a sitemap is larger than 10MB please use gzip to compress it. Gzip is a free file formatting software application that is used for both compressing and decompressing. Note the size limitations for sitemaps and sitemap index files—no more than 50,000 files and no larger than 10MB.


Is there an XML schema that can be used to validate an XML sitemap?

Yes, this schema can be found at http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd for sitemaps, and http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd for sitemap index files.


Where will I find answers about other questions I have regarding protocol for a sitemap?

It would be best to review documentation provided by each individual search engine regarding the use and submission of sitemaps. It is likely that each one has specific tips and tricks to get the most out of the sitemaps that you are creating.


How is lastmod date computed?

When considering static files, this would be the date that the actual file was updated. You will be able to use the UNIX date command to retrieve this date:

$ date --iso-8601=seconds -u -r /home/foo/www/bar.php
>> 2016-10-26T08:56:39+00:00

For a number of dynamic URLs, it is possible to compute a lastmod date easily based on when the underlying data has last been changed, or you can use approximation based on any periodic updates when applicable. Even when you use an approximate date or timestamp, you are helping crawlers to avoid crawling those URLs that have not changed since the last time they were accessed to crawl. The result of this is that your bandwidth and CPU requirements for web servers will be reduced. Lastmod stands for last modified, and is extremely helpful for pages that will not be changing much, so that the bandwidth is not wasted on crawling information that is not new.


All in All

Creating a successful sitemap is just a portion of the process. After it has been created, you will need to submit it and then allow it to be crawled and indexed by search engines. Once your sitemap has been indexed, you are given a better chance of having the site discovered and reaching your target audience. It should be noted that the more metadata that is supplied, the better the chance is that your site will be presented by the search engine for the relevant search query. Do not get overwhelmed by all of the rules and standards that are put in place. All of these things are created so that sitemaps keep working in the best way that they can, and so that they continue to benefit their users. Creating sitemaps aren't difficult and there are actually generators available on the internet now. Whichever route you decide to take for creating the sitemap, just don not forget all of the important guidelines and double check to make sure that they are being followed. In no time you will be creating booming sitemaps that will get you crawled and indexed successfully.

Garenne Bigby
Author: Garenne BigbyWebsite: http://garennebigby.com
Founder of DYNO Mapper and Former Advisory Committee Representative at the W3C.

Back
Remediate.Co

Related Articles

Create Visual Sitemaps

Create, edit, customize, and share visual sitemaps integrated with Google Analytics for easy discovery, planning, and collaboration.

Remediate.Co

Popular Tags

Search Engine Optimization SEO Accessibility Testing Create Sitemaps Sitemaps UX User Experience Sitemap Generator Content Audit Visual Sitemap Generator
Create Interactive Visual Sitemaps

Discovery has never been easier.

Sign up today!