Recently one of my clients wanted to generate a site map using an online generator that has a limit of 500 pages per each site map. Their site isn’t particularly large and easily has less than 500 pages on it. But when we attempted to generate the map it hit the 500 page limit – very mysterious! We decided to open up the document and see what was going wrong.
Fluff and Nonsense
There were a few culprits, one was some mis-spelt links which were easily fixable using the many broken link finders on the web.
The second one was the WordPress blog that was hosted on the same domain: there were categories and tags and multiple pages alongside all of the individual posts and whatnot. We decided that only about half of this was useful and that the rest, specifically the categories and tag pages, should be ignored if possible.
And last but not least there search results. The main website contains a search page that brings up the profiles of various business affiliates. Because the site contains links to different search results, and those search results themselves link to related searches, the site map ended up with hundreds of different complicated search configurations, which were mostly unnecessary duplicates of very simple searches.
Solution #1 – robots.txt
The blog issue was resolved by using the robots.txt file to instruct robots on what they should be looking at or, more specifically, what they shouldn’t.
User-Agent: * Disallow: /blog/category/ Disallow: /blog/tag/
This basically says “Hey, all robots, please stay away from the /blog/category/ and /blog/tag/ pages okay?”.
Okay, that’s one bunch of pages dealt with!
Solution #2 – Robots meta tag
Next we want to stop search results pages from being indexed – but only for certain results. We decided to allow very simple one-parameter searches to be indexed. But we can’t use a robots.txt file for this because we need to decide whether the page should be indexed as it loads. So what we can do is use the robots meta tag.
First of all we use PHP to look at the parameters and decide if this page should be indexed, and if the answer is no we can put this in the header:
<meta name="robots" content="noindex,follow">
This tag tells robots that it should not index the page. Notice also that we have the word “follow” in there – that’s because we still want robots to follow and index the links on the page, just not this page itself.
And that is how you talk to robots.