Host A Website Inside Robots.txt

My previous post talked through a minor undocumented robots.txt rule. I'm fond of niche SEO weirdness, so was very pleased to hear that Alec Bertram discovered that it's possible to host a website inside a valid robots.txt file. A live example of this contribution to SEO history can be viewed here:What's Happening Text following the hash ('#') character is ignored by search engines in robots.txt files. Abusing this rule, Alec made website that validates as a robots.txt file. Rather than repeat him I'll just copy the text off his website: What's going on hereWhen parsing robots.txt files, search engines ignore anything that's after a hashtag in a robots.txt file - this turns it into a comment. However, a crawler will read anything on the page when it doesn't think it's looking at [...]

De-index Pages Blocked in Robots.txt

Sometimes we want things that are already indexed removed from the index. We remove pages from the index using the noindex directive. This directive is implemented through the on-page meta element, or more rarely through the HTTP header X-Robots tag (more here). The standard approach (not mine) for removing pages we don't want crawled from the index is :Apply a noindex meta-tag to the offending pages. The noindex tag won't be read unless the page is crawled by Googlebot. So leave the page open to crawl and wait for Google to crawl it, hopefully resulting in the pages dropping out of the index. Once the pages drop out of the index, implement robots.txt restrictions.All the while, the offending pages (pages we don't want crawled) are being crawled. This may include previously [...]

Server Logs After Excel Fails – BrightonSEO 2016

This is the writeup of the talk I gave at Brighton SEO on 22nd April 2016. Slides for the talk can downloaded here, though reading the post is probably a better use of your time. You can link to this page or the homepage ( ͡° ͜ʖ ͡°).About Me I worked at builtvisible between 2011 and 2015, initially as "Junior SEO Executive" and most recently as "Senior Technical SEO Consultant". Since then I've been freelance, mostly working with agencies. So if you're an agency that needs some Technical SEO support... Outline In this presentation I want to quickly cover the following:Talk a little about access logs. Talk about some command line tools. Show you some ways to apply those tools when dealing with more access logs than Excel can reasonably handle.Assumptions The [...]

Wayback Machine for Historical Redirect Chains

I'm fairly obsessive about cutting down on redirect chains. One of the biggest challenges for doing this is finding enough historical data. A few developers leave, the Ecommerce manager disappears under mysterious circumstances, and the organisation no longer has access to this information. The following technique is very useful once the orthodox sources have been exhausted.You've probably seen archive.org before, and even used the wayback machine to diagnose problems. It's one of the best SEO tools available, and one of the best things on the internet. Most of you will be well aware of this screen:Fewer people are aware of this feature:This gives us a list of the unique pages archive.org has catalogued since it's been running. Let's say we're on a migration project, and [...]

ICANN Drop Your Domain

If you've ever taken a look at more competitive SERPs, you've likely run into the completely bogus whois data that's used to preserve anonymity. This is frustrating but makes sense; no-one wants to be directly linked to spam-fuelled domains, and spammers don't want to link their domains together.Interestingly, this opens the domains up to a serious vulnerability.You're breaking the rules if you provide fake whois information, and breaking these rules (even accidentally) can get your site disabled, and even make your domain name available for purchase by others.In this blog post we're going to report one of my own domains to ICANN (Internet Corporation for Assigned Names and Numbers). Here's the ICANN procedure in brief:For our test, I've registered fakewhois.xyz (you can [...]

Bulk Inspect http Response Headers

There are plenty of SEO reasons you might want to look at http headers. Google love offering them as an alternative implementation for a number of directives, including:Vary: User-Agent Canonical Hreflang Implementation X-Robots (noindex, nofollow)Link: <http://es.example.com/>; rel="alternate"; hreflang="es" Link: <http://www.example.com/>; rel="canonical" X-Robots-Tag: googlebot: nofollow Vary: User-Agent If anyone's doing anything a little sneaky, you can sometimes spot it in the file headers.There are a number of tools that let you inspect single headers, including your browser (press F12 and poke about to get something like the following).When you need to check which pages on a domain aren't returning the correct hreflang headers, this method [...]