Block Googlebot Crawl by Folder Depth

Some sites have deep, deep, duplicative architecture. Usually this is the result of a faceted navigation. This is especially true for enterprise platforms. And like any healthy relationship, you can't go in expecting them to change. Sometimes you'll need to admit defeat and use an appallingly ugly but kind of elegant band-aid.In short - picking the appropriate robots.txt disallow rule from the following can work: / /*/ /*/*/ /*/*/*/ /*/*/*/*/ /*/*/*/*/*/ /*/*/*/*/*/*/ /*/*/*/*/*/*/*/ /*/*/*/*/*/*/*/*/ /*/*/*/*/*/*/*/*/*/ /*/*/*/*/*/*/*/*/*/*/ /*/*/*/*/*/*/*/*/*/*/*/ /*/*/*/*/*/*/*/*/*/*/*/*/ /*/*/*/*/*/*/*/*/*/*/*/*/*/ /*/*/*/*/*/*/*/*/*/*/*/*/*/*/ /*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/ /*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/ /*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/ etc... This blocks [...]

Crawl > Indexation

One of the most pervasive SEO beliefs I encounter is roughly "if your links aren't getting indexed, they might as well not exist". This causes people to try and get their links indexed, either by using a link indexing service or by blasting links to their links. Although this behaviour can be useful, I think the idea that motives this is wrong for a few reasons. If you believe your links count more for being indexed, please consider the following:Look in your webmaster tools (sorry, "Search Console"). Download all the 'links to your site'. Check the indexation status of all of these links. You now have a list of links that Google knows about, but aren't indexed. Have an unsuccessful reconsideration request. Look at the example links. Check to see if they're indexed. Think about deindexed [...]

Blocking and Verifying Applebot

Earlier today Apple confirmed the existence of their web crawler Applebot. This means that we'll be seeing it crop up a little more in server log analysis. Filtering Server Logs to Applebot As anyone can spoof their useragent to Applebot while crawling the web, we can use the IP range Apple have given us to validate these rogue visits. Currently legitimate Applebot visits will come from any IP between 17.0.0.0 and 17.255.255.255. The actual range is probably substantially smaller than this. We can pull the files we need from out server logs using the following {linux|mac|cygwin} commands in our bash terminal:First, filter to everyone claiming to be Applebot: grep 'Applebot' access.log > apple.log Then, filter to the 17.[0-255].[0-255].[0-255] IP range: grep -E '17\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' [...]

Server Logs, Subdomains, ccTLDs

Server logs have a few major drawbacks, one of which I'd hope to address today. It's not an elegant solution but it (more or less) works. Firstly, please read this post for an overview on server logfile analysis for SEO and you'll hopefully see where I'm coming from. I think access logs are probably the best source of information available for diagnosing onsite SEO issues. A Problem If you have a little experience with server logs, you've probably encountered the following: 188.65.114.122 - - [30/Sep/2013:08:07:05 -0400] "GET /resources/whitepapers/retail-whitepaper/ HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"188.65.114.122 - - [30/Sep/2013:08:07:06 -0400] "GET /resources/whitepapers/retail-whitepaper/ HTTP/1.1" 301 "-" "Mozilla/5.0 (compatible; [...]