Is your XML sitemap giving too much to Google?

An XML sitemap is one of the most important tools for making sure search engines can find and index the pages of your website. It’s like a roadmap that gives search crawlers directions to every nook and cranny of your site. But sometimes you may not want every page of your site to be listed in search results. Fortunately, there are ways to control what Google indexes and what it displays in search results.

In this post, we map out a few important steps that ensure Google is indexing only the content you want public and is not indexing content you want to keep private. 

Understanding the xml sitemap

When talking about sitemaps, there are html sitemaps and xml sitemaps. The html sitemap is what you can usually find in the footer of a website, and is an extensive list of all of the key pages on the site, but not every page (such as individual articles or news items). The page exists as a guide to get visitors to the general place within the website that they want to visit so they can more easily navigate to the specific page they want. The xml sitemap, on the other hand, typically lists everything. Every published page on a website will, by default, be included in the xml sitemap. This sitemap is what gets submitted to search engines in order to provide them with a complete list of all pages on a website that should be indexed.

Pages that should not be in a sitemap

Every website has at least a few pages that shouldn’t be visible to the general public and therefore should not be included in the xml sitemap. For instance, the page that administrators use to log in to their website would be a page that shouldn’t appear in search results. Many automatic sitemap generators are already set up to avoid those pages. Though some may need to be configured to do so. 

In some cases, there may be pages on a website that aren’t administrative pages, but they are pages that you might not want showing up in search results. This is especially common with Wordpress sites because Wordpress automatically creates pages for elements like tags, categories, and authors. And if you’re using a tool like Yoast SEO which offers xml sitemap creation in its feature-set, it will include every published item in the sitemap unless configured otherwise. 

How to exclude pages from the sitemap

Before submitting a sitemap to search engines, scan the sitemap to make sure it only contains the pages that should appear in search results. Some xml sitemap generators offer options to exclude pages by type and some will allow you to exclude individual pages. 

—For Wordpress/Yoast users—

The Yoast SEO plugin has the option to exclude pages by type, but finding the toggles within the plugin to include or exclude various types of content from the sitemap can be a bit tricky if you’re not familiar with the tool. For those that aren’t, this is where you will find those settings. 

In the Yoast menu, choose Search Appearance

Yoast sidebar menu

The toggles to include/exclude your content from the sitemap can be found in the Content Types, Taxonomies, and Archives tabs. To exclude a type of content (e.g. categories, tags, authors) from search results, find the toggle to include that content type in search results, and switch it to “No” which will remove any content of that type from the xml sitemap. 

Yoast Search Appearance screen

More ways to exclude content from search results

For content that should remain hidden from the public eye, simply excluding it from the xml sitemap is not sufficient. There is also a file called a robots.txt file that gives specific instructions to search engines not to crawl or index pages listed in that file. Any pages that need to remain out of search results should have the “disallow” directive in the robots.txt file.

In Conclusion

While search engines do index sites based on what they discover with their crawlers, the xml sitemap and robots.txt file give website administrators a certain measure of control over what gets indexed and what doesn’t. But it’s important to make sure these tools are correctly configured and you’re not giving Google more than you should.