A post on scraping the wayback machine for URLs to redirect.
If you’ve ever taken a look at more competitive SERPs, you’ve likely run into the completely bogus whois data that’s used to preserve anonymity. This is frustrating but makes sense; no-one wants to be directly linked to spam-fuelled domains, and spammers don’t want to link their domains together.
Interestingly, this opens the domains up to a serious vulnerability.
You’re breaking the rules if you provide fake whois information, and breaking these rules (even accidentally) can get your site disabled, and even make your domain name available for purchase by others.
In this blog post we’re going to report one of my own domains to ICANN (Internet Corporation for Assigned Names and Numbers). Here’s the ICANN procedure in brief:
For our test, I’ve registered fakewhois.xyz (you can register a site on this questionable tld for $1 on namecheap, with free whoisguard):
We aren’t using the free whoisguard. Post registration, the whois records are updated with my registrar to the following:
Mr. Fake 123 Fake Street Springfield; NA W8 1BF GB tel: +44.1234567891 fax: +1.1234567891 [email protected]
The majority of fake whois data is set up with completely fake accounts, which do not forward to a monitored account. As a result they are vulnerable to this method.
The email used will forward to my own, but it will be ignored. We want to see how the registrar acts, and if they raise the issue of the reported fake data at the registrar account level (rather than simply contacting [email protected]).
Domains under WHOIS protection still reveal their information to ICANN, just not the general public. Still, since my plan is to anonymously snitch on myself here, I make this information public:
Once this fake information is public and verified with external whois services, we report the site to ICANN:
Date of submission: 24.11.2015
One week after submission, ICANN respond with the following:
Thank you for submitting a Whois inaccuracy complaint concerning the domain name http://fakewhois.xyz. Your report has been entered into ICANN's database. For reference your ticket ID is: OUM-161-68604. A 1st Notice will be sent to the registrar, and the registrar will have 15 business days to respond. For more information about ICANN's process and approach, please visit http://www.icann.org/en/resources/compliance/approach-processes . Sincerely, ICANN Contractual Compliance
There are plenty of SEO reasons you might want to look at http headers. Google love offering them as an alternative implementation for a number of directives, including:
- Vary: User-Agent
- Hreflang Implementation
- X-Robots (noindex, nofollow)
Link: <http://es.example.com/>; rel="alternate"; hreflang="es" Link: <http://www.example.com/>; rel="canonical"X-Robots-Tag: googlebot: nofollow Vary: User-Agent
If anyone’s doing anything a little sneaky, you can sometimes spot it in the file headers.
There are a number of tools that let you inspect single headers, including your browser (press F12 and poke about to get something like the following).
A few months ago I bought cheap a tablet running Android and Windows 10 (no regrets so far). With this came the desire to run full versions of Windows specific applications portably, using either a tethered connection or readily accessible WiFi.
This is impractical, given that tablets have limited on-board storage. We may only have 8GB to work with. But SD cards are cheap ( I’ve seen branded 128GB micro SD cards for £35 at the moment).
The release of Windows 10 included a disabled ‘install to SD card‘ feature pegged for a future release, so I was unable write this post until then. The ‘Threshold 2‘ or ‘November‘ update re-enabled this feature.
Server logs have a reverential status among Technical SEOs. We believe they give us information on how Googlebot actually behaves, and let us diagnose issues we otherwise could not uncover. Although we can piece this as-it-happens information together by ordering timestamps, have you known anyone to actually do this? Wouldn’t it be nicer to simply watch Googlebot as it crawls a website instead? Wouldn’t that make a great screensaver for the obsessive?
To get decent mileage out of this article you’ll need access to a Linux or Mac installation, or a Windows machine with the GNU Core Utils installed (the easiest thing is GOW). And you’ll need some live server access logs. If you just wish to test this out, you can use a free account on something like Cloud9.io to set up an apache server, or similar.
The first thing we need is an ability to read new entries to our log file as they come in. Once you are comfortable, enter the following command into your terminal:
tail -f /var/log/apache2/access.log
You’ll need to update the filepath with wherever your current access log is located. ‘tail -f‘ monitors the access.log file for changes, displaying them in the terminal as they happen. You can also use ‘tail -F‘ if your access.log often rotates.
This displays all logged server access activity. To restrict it to something more useful, the information must be ‘piped’ to another command, using the ‘|’ character . The output of the first command becomes the input of the second command. We’re using Grep to limit the lines displayed to only those mentioning Googlebot:
tail -f /var/log/apache2/access.log | grep 'Googlebot'
This should match the user agent. You can read up on filtering access logs to Googlebot for more information, and why just relying on useragent for analysis isn’t the best idea. It’s very easy to spoof:
Once you’re running the above command, you can visit the server whilst spoofing your user agent to Googlebot. Your activity will display live in the terminal as you crawl. This means it’s correctly limiting the activity to real and fake Googlebot, which is enough for this demonstration.
We could end the article here. Our terminal is reporting to us whole access.log lines. This can make us feel more like a hacker, but it isn’t particularly useful. What do we actually care about seeing? Right now, I think our live display should be limited to the requested URL, and the server header response code. Something like:
/ 200 /robots.txt 304 /amazing-blog-post 200 /forgotten-blog-post 404 /forbidden-blog-post 403 / 200
So we need to constrain our output. We can do this with the AWK programming language. By default this parses text into fields using the space separator. Say our access log is as follows:
website.co.uk 22.214.171.124 - - [27/Oct/2015:23:09:05 +0000] "GET /robots.txt HTTP/1.1" 304 0 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 126.96.36.199
Looking at this example log, we want the 8th and 10th fields as our output:
Like most of my posts this is not worth implementing in any manner. This is unlimited budget SEO. Works in theory SEO. This is almost make-work.
There are no brakes on the marginal gains train.
We believe that broken links leak link equity. We also believe that pages provide a finite amount of link equity, and anything hitting a 404 is wasted rather than diverted to the live links.
The standard practice is to swoop in and suggest a resource that you have a vested interest in to replace the one that’s died. There is a small industry dedicated to doing just this. It works, but requires some resource.
If we instead get the broken links removed, the theory goes, we increase the value of all the links remaining on the page. You can increase the value of external links to your site by getting dead links on the pages that host them removed.
You could, of course, broken link-build here instead. You already have a link, and if you have the resources it’s going to be superior in equity terms. But the methodology is the same – and you’ve probably not gone out of your way before to broken linkbuild on pages you already have a link from.
- Get a list of URLs linking to your site.
- Extract and pull headers for every link on these pages.
- If appropriate, contact the pages hosting dead links or dead redirects suggesting removal.
To make our task as easy as possible we will need a list of broken links with the pages that spawned them.
Majestic, Moz, Ahrefs, and Search Console. Crawl these for the live links. Whittle down to the ones you actually like – cross reference with your disavow file and make sure the links are only going to live pages on your site. I’d heavily suggest limiting this to followed URLs if you want any value out of this technique.
Extract Links and Crawl for Errors
We now need to take linksilike.txt and crawl each page, extracting every link as we go. We then need to get the server header responses from each of these links. Although writing a script to do this is probably the best way, the easiest way is to use tools we already have at our disposal. It’s overkill compared to getting the server header responses.
Unfortunately you can ‘t use Screaming Frog in list mode to do this ( correct me if I’m wrong). You need to crawl a depth one deeper than the list. To do this, first switch to list mode, then set the crawl depth limit to one (Configuration > Spider > Limits). If you do it the other way round you’ll end up writing misinformation on the internet.
Let’s pretend we have something against Screaming Frog, so we download Xenu – it’s what was used before Screaming Frog.
This is a heavier task than you might expect. A sample of the links from Search Console to this site yielded ~15k URLs to be checked. There is a 64bit Xenu beta available if you require more RAM. Here are the settings I’ve found most useful:
Check ‘Check External URLs’ under options.
File > Check URL list (test) to load in your URLs and start crawling.
Once complete, hit CTRL-R a few times to recrawl all the error pages (404’s will remain so, generic errors may be updated to more useful ones).
The main issue here is that by default Xenu isn’t going to give us the information we want in a nice TSV export. It does, however, offer us what we want in the standard html report – we get a list of URLs that link to us with a list of errors that stem from them. They typically take the following form:
http://01100111011001010110010101101011.co.uk/paid-links/ http://sim-o.me.uk/ \_____ error code: 410 (the resource is no longer available) http://icreatedtheuniverse/ \_____ error code: 12007 (no such host) http://www.carsoncontent.com/ \_____ error code: 12007 (no such host) http://www.seo-theory.com/ \_____ error code: 403 (forbidden request) http://www.halo18.com/ \_____ error code: 12007 (no such host)
http://seono.co.uk/2013/04/30/whats-the-worst-link-youve-ever-seen/ http://seono.co.uk/xmlrpc.php \_____ error code: 405 (method is not allowed) https://twitter.com/jonwalkerseo \_____ error code: 404 (not found) http://www.alessiomadeyski.com/who-the-fuck-is-he/ \_____ error code: 404 (not found) https://www.rbizsolutions.com.au/ \_____ error code: 12029 (no connection)