Sitemaps are the best and easiest way for webmasters to inform major search engines about pages on their website that are open for crawling and indexing. Broken down simply, a sitemap is an XML file that is a list of URLs contained on each site, along with added metadata about each of the URLs so that the search engines can more accurately crawl these sites. This metadata can be anything from how often it changes, when it was last updated, its priority in relation to other pages on the site, and much more.
Search engine crawlers will generally discover pages from links that are contained on sites and so forth. Having a sitemap will supplement this data and allow the crawlers that support the sitemap to gather up all of the URLs from the sitemap and learn about those URLs as well as the associated metadata. Though using a sitemap does not guarantee that a web page will be included in a search engine's results, it will provide very important data to the search engine so that the crawler can perform more accurately. There are certain guidelines that are put in place so that webmasters create sitemaps that will be helpful to them in the long run. Whether you are creating the sitemap totally from scratch or by using a sitemap generator, please remember that there are guidelines in place to ensure a successful sitemap, which leads to a successful crawl and successful indexing.
You should use W3C Datetime encoding for the lastmod time stamps and all of the other dates and times that are to be used within this protocol. As an example, 2016-11-09T14:12:14+00:00. Encoding like this allows you to exclude the time component of the ISO8601 format. For example, 2016-11-09 would also be valid. It should be noted that if you have a website that changes on a frequent basis, it would be best practice to include the time component so that crawlers will be able to gather more complete information about the website and when it was last modified, especially if on a daily basis.
All XML sitemaps are the same, and any data values (URLs included) have to use entity escape codes for specific characters like an ampersand, single quote, double quote, greater than, and less than. All URLs will also need to follow the RFC-3986 standard regarding URIs, RFC-3987 standard regarding IRIs, and the XML standard. When you are using script to generate a URL, you can normally URL escape them as a part of the script. Even so, it is still necessary to entity escape them.
Sitemaps should not be any larger than 10MB and may only contain a maximum of 50,000 URLs. These limits are in place so make sure that web servers do not get overwhelmed by serving extremely large files. If a website contains more than the maximum number of URLs or it is larger than 10MB, you will need to create more than one sitemap file, and then place these files in a sitemap index file. It is suggested to use a sitemap index file even if your website is small currently bit you plan to grow it larger—meaning that it could potentially include more than 50,000 URLs or be larger than 10MB. A sitemap file index can hold up to 1,000 sitemaps, but cannot go over 10MB. Sitemaps can be compressed using gzip.
Absolutely—a sitemap file has to use UTF-8 encoding. The bots and crawlers that are reading the sitemaps are not able to read much beyond the alphanumeric characters, and especially not the more complex characters like punctuation marks.
It is highly recommended by many experts that your sitemap is placed at the root directory of your own HTML server; it would look like http://www.sample.com/sitemap.xml. There may be a situation in which you would want to place varying sitemaps for different paths on the website, like if certain security permissions in your organization separate write access to various directories. It would be assumed that if you have permission to upload http://www.sample.com/path/sitemap.xml, that you would also have permission to access metadata unver http://www.sample.com/path/. All of the URLs that are listed within the sitemap have to be on the same host as the sitemap. Let's say the sitemap is located at http://www.sample.com/sitemap.xml, it is not able to include URLs from
http://www.subdomain.sample.com. If the sitemap is located at http://www.sample.com/folder/sitemap.xml, it cannot include URLs from http://www.sample.com
The “priority” tag in the sitemap will only indicate the importance of the page in relation to the other URLs on your own site and will not affect the ranking of the page on search results. In short—marking a URL as priority will not give it priority over other public web pages all over the internet, the only priority that it takes is over the other pages that are contained within the single website. You are not given priority over other sites in the search results, only the single page is over your other pages.
Once the sitemap has been created, you will need to let search engines know about it. This is done by directly submitting it, pinging it, or adding the location of the sitemap to your own robots.txt file. Once the search engine is notified of the sitemap, it is then able to crawl and index it. Once it has been indexed, it will sit within the search engine and then be recalled when a search query is performed and the site is a relevant match.
In short, yes. You will need to include the protocol (http or https) within your URL. You will also need to use a trailing slash in the URL if the web server does require one. As an example, http://www.sample.com/ is an okay URL for a sitemap, while www.sample.com is not.
Yes, absolutely include both URLs for these. By including both, you are ensuring accuracy and success for your sitemap(s).
No. It is only necessary to list one version of a URL within the sitemap. When you include multiple versions of the URL on the sitemap, this could result in a fractured crawling of the site. When choosing which one to use, think about if you have canonicalized any of the URLs and go with that. Consistency will work in your favor so that confusion does not take over when selecting which version of the pages to use.
It is possible to list only URLs that change on a frequent basis in a small number of sitemaps and then you should use the lastmod tag within the sitemap index file to point to those sitemaps. This makes it possible for search engines to periodically crawl only those sitemaps that have been changed. When you limit the number of URLs that are being crawled, you are benefiting your bandwidth as well. That being said, if you feel the need to, you may submit as many sitemaps and sitemap index files that you feel necessary. As long as they meet all requirements and do not contain too many URLs (50,000) and are not too big (10MB), then do as many as you would like.
Yes, you will need to remove the session ID from the URL. If you include this, you could hinder the crawling of the site, or have it become redundant. The session ID within a URL is too specific of parameters for search engines to successfully crawl. It is totally possible to obtain a URL that does not have a session ID attached to it, and often times a session ID will actually change with each visit to the website, so that is not a stable URL, hence the problems that it may cause.
No, the position of a URL within a sitemap does not influence how it is used or interpreted by search engines or crawlers. This is to say that URLs that are placed at the beginning of a sitemap do not hold more importance than those placed further down on the list. The only thing that will influence their priority is when that tag is attached to the link in the proper way, as discussed below.
Yes, if a sitemap is larger than 10MB please use gzip to compress it. Gzip is a free file formatting software application that is used for both compressing and decompressing. Note the size limitations for sitemaps and sitemap index files—no more than 50,000 files and no larger than 10MB.
Yes, this schema can be found at http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd for sitemaps, and http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd for sitemap index files.
It would be best to review documentation provided by each individual search engine regarding the use and submission of sitemaps. It is likely that each one has specific tips and tricks to get the most out of the sitemaps that you are creating.
When considering static files, this would be the date that the actual file was updated. You will be able to use the UNIX date command to retrieve this date:
$ date --iso-8601=seconds -u -r /home/foo/www/bar.php
For a number of dynamic URLs, it is possible to compute a lastmod date easily based on when the underlying data has last been changed, or you can use approximation based on any periodic updates when applicable. Even when you use an approximate date or timestamp, you are helping crawlers to avoid crawling those URLs that have not changed since the last time they were accessed to crawl. The result of this is that your bandwidth and CPU requirements for web servers will be reduced. Lastmod stands for last modified, and is extremely helpful for pages that will not be changing much, so that the bandwidth is not wasted on crawling information that is not new.
Creating a successful sitemap is just a portion of the process. After it has been created, you will need to submit it and then allow it to be crawled and indexed by search engines. Once your sitemap has been indexed, you are given a better chance of having the site discovered and reaching your target audience. It should be noted that the more metadata that is supplied, the better the chance is that your site will be presented by the search engine for the relevant search query. Do not get overwhelmed by all of the rules and standards that are put in place. All of these things are created so that sitemaps keep working in the best way that they can, and so that they continue to benefit their users. Creating sitemaps aren't difficult and there are actually generators available on the internet now. Whichever route you decide to take for creating the sitemap, just don not forget all of the important guidelines and double check to make sure that they are being followed. In no time you will be creating booming sitemaps that will get you crawled and indexed successfully.