monoliths and dimensions out of round

Block Googlebot Crawl by Folder Depth

Some sites have deep, deep, duplicative architecture. Usually this is the result of a faceted navigation. This is especially true for enterprise platforms. And like any healthy relationship, you can’t go in expecting them to change. Sometimes you’ll need to admit defeat and use an appallingly ugly but kind of elegant band-aid.

In short – picking the appropriate robots.txt disallow rule from the following can work:

/
/*/
/*/*/
/*/*/*/
/*/*/*/*/
/*/*/*/*/*/
/*/*/*/*/*/*/
/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/ etc...

This blocks crawl by folder depth. You might be thinking “this is awful, why would you ever want to do this?”

Example Crawl Distribution

Whether this makes sense for you will depend on Googlebot’s requests by depth. The following crawl distribution is fairly typical for sites where the only landing pages are generated by faceted navigation:

++++++++++
++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
+++
++++
++++
++
+
+
+
+
+
+
+
+
+
+
+
+
+

Typically all the content you could possibly want crawled and indexed will be available at a depth of 3 – 5. Being rather conservative and blocking only the very obvious depths of junk can prevent hundreds of thousands (in some cases millions) of novel pages from being crawled. Googlebot loves novelty, and will continue to visit and revisit these useless pages, increasing the proportion of your site it knows to be junk.

Sign Me Up

As always, you should first consider the value of any external links you are about to kill the value of. Links are expensive. You should also be very confident in any ‘Allow:’ rules you are setting up.

Start with a longer query than you think is necessary. You can always remove a ‘*/‘ if you need to.

3 thoughts on “Block Googlebot Crawl by Folder Depth”

  1. Assuming you’ve got full technical control of the site, is there ever any point blocking in robots.txt? The pages can still be indexed and the blocked pages still soak up the juice.

    In the case of a faceted navigation it’s far better to stop them being found (remove href, use JS event).

    Or sod the crawl budget and let pages be crawled but noindex them so that juice still flows.

    Thoughts?
    Jon

    1. Hey Jon,

      Assuming you’ve got full technical control of the site, is there ever any point blocking in robots.txt?” – it’s a big assumption. If you did have full technical control (and there weren’t associated development costs with making changes) then you’d not need to block with robots.txt because your house would be in order. Usually it’s worth having as a safety net. As you’ll know it’s a lot easier to push through with clients, and it’s one of the quickest SEO changes to reverse. Even in the perfect development scenario we still have Google’s crawling from memory to account for, and there’s little downside to being zealous with robots.txt so long as you keep the site’s external links in mind.

      The pages can still be indexed and the blocked pages still soak up the juice.” If I understand correctly, the linked to pages you’re blocking can in some sense ‘soak up juice’, but the novel yet-to-be-encountered URLs aren’t going to (because there is no chance for Googlebot to encounter them). It’s this deeper, duplicative stuff we’re so concerned with.

      In the case of a faceted navigation it’s far better to stop them being found (remove href, use JS event).” – Strongly agree, but would be cautious about JS solutions (I’ve had a few change from spec in the development process into being quite SEO unfriendly).

      Or sod the crawl budget and let pages be crawled but noindex them so that juice still flows.” – I know a few people are fond of this, but it’s not something I’ve seen working as well as restricting crawl. The smaller the site is, the more viable letting it flow could be (but I’d bet against it). A site that’s generating a few million URLs from a set of hundreds of URLs-we-care-about is more problematic -if we’re liberal with crawl then Googlebot spends the vast majority of it’s time crawling junk.

      Hope that makes sense!
      Thanks for the comment,
      Oliver

      1. Hi Oliver

        Great points and I think I agree based on your scenarios.

        Although:

        If I understand correctly, the linked to pages you’re blocking can in some sense ‘soak up juice’, but the novel yet-to-be-encountered URLs aren’t going to (because there is no chance for Googlebot to encounter them).

        They will, they get the juice when Google sees they’re linked to, not when Google crawls them. The only way they’d not get juice is if Google never saw them because the parent level was blocked from crawling. I.e. level 6 being blocked from crawl will stop level 7-100 being crawled (if that’s the only way in).

        Is that what you meant? If so I agree!

        Cheers for your reply
        Jon.

Leave a Reply

Your email address will not be published. Required fields are marked *