Watch Googlebot Crawling

Server logs have a reverential status among Technical SEOs. We believe they give us information on how Googlebot actually behaves, and let us diagnose issues we otherwise could not uncover. Although we can piece this as-it-happens information together by ordering timestamps, have you known anyone to actually do this? Wouldn’t it be nicer to simply watch Googlebot as it crawls a website instead? Wouldn’t that make a great screensaver for the obsessive?

To get decent mileage out of this article you’ll need access to a Linux or Mac installation, or a Windows machine with the GNU Core Utils installed (the easiest thing is GOW). And you’ll need some live server access logs. If you just wish to test this out, you can use a free account on something like Cloud9.io to set up an apache server, or similar.

The first thing we need is an ability to read new entries to our log file as they come in. Once you are comfortable, enter the following command into your terminal:

tail -f /var/log/apache2/access.log

You’ll need to update the filepath with wherever your current access log is located.  ‘tail -f‘ monitors the access.log file for changes, displaying them in the terminal as they happen. You can also use ‘tail -F‘ if your access.log often rotates.

This displays all logged server access activity. To restrict it to something more useful,  the information must be ‘piped’ to another command, using the ‘|’ character . The output of the first command becomes the input of the second command. We’re using Grep to limit the lines displayed to only those mentioning Googlebot:

tail -f /var/log/apache2/access.log | grep 'Googlebot'

This should match the user agent. You can read up on filtering access logs to Googlebot for more information, and why just relying on useragent for analysis isn’t the best idea. It’s very easy to spoof:

Once you’re running the above command, you can visit the server whilst spoofing your user agent to Googlebot. Your activity will display live in the terminal as you crawl. This means it’s correctly limiting the activity to real and fake Googlebot, which is enough for this demonstration.

We could end the article here. Our terminal is reporting to us whole access.log lines. This can make us feel more like a hacker, but it isn’t particularly useful. What do we actually care about seeing? Right now, I think our live display should be limited to the requested URL, and the server header response code. Something like:

/ 200
/robots.txt 304
/amazing-blog-post 200
/forgotten-blog-post 404
/forbidden-blog-post 403
/ 200

So we need to constrain our output. We can do this with the AWK programming language. By default this parses text into fields using the space separator. Say our access log is as follows:

website.co.uk 173.245.50.107 - - [27/Oct/2015:23:09:05 +0000] "GET /robots.txt HTTP/1.1" 304 0 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 173.245.50.107

Looking at this example log, we want the 8th and 10th fields as our output:

website.co.ukONE173.245.50.107TWO-THREE-FOUR[27/Oct/2015:23:09:05FIVE +0000]SIX"GETSEVEN/robots.txtEIGHTHTTP/1.1"NINE304TEN0ELEVEN"-"TWELVE"Mozilla/5.0THIRTEEN(compatible;FOURTEENGooglebot/2.1;FIFTEEN+http://www.google.com/bot.html)"SIXTEEN173.245.50.107 read more

Broken Link Destruction for Better Rankings

Like most of my posts this is not worth implementing in any manner. This is unlimited budget SEO. Works in theory SEO. This is almost make-work.

There are no brakes on the marginal gains train.

Theory

We believe that broken links leak link equity. We also believe that pages provide a finite amount of link equity, and anything hitting a 404 is wasted rather than diverted to the live links.

The standard practice is to swoop in and suggest a resource that you have a vested interest in to replace the one that’s died. There is a small industry dedicated to doing just this. It works, but requires some resource.

If we instead get the broken links removed, the theory goes, we increase the value of all the links remaining on the page. You can increase the value of external links to your site by getting dead links on the pages that host them removed.

You could, of course, broken link-build here instead. You already have a link, and if you have the resources it’s going to be superior in equity terms. But the methodology is the same – and you’ve probably not gone out of your way before to broken linkbuild on pages you already have a link from.

Practice

  • Get a list of URLs linking to your site.
  • Extract and pull headers for every link on these pages.
  • If appropriate, contact the pages hosting dead links or dead redirects suggesting removal.

To make our task as easy as possible we will need a list of broken links with the pages that spawned them.

Get Links

Majestic, Moz, Ahrefs, and Search Console. Crawl these for the live links. Whittle down to the ones you actually like – cross reference with your disavow file and make sure the links are only going to live pages on your site. I’d heavily suggest limiting this to followed URLs if you want any value out of this technique.

Extract Links and Crawl for Errors

We now need to take linksilike.txt and crawl each page, extracting every link as we go. We then need to get the server header responses from each of these links. Although writing a script to do this is probably the best way, the easiest way is to use tools we already have at our disposal. It’s overkill compared to getting the server header responses.

Unfortunately you can‘t use Screaming Frog in list mode to do this (correct me if I’m wrong). You need to crawl a depth one deeper than the list. To do this, first switch to list mode, then set the crawl depth limit to one (Configuration > Spider > Limits). If you do it the other way round you’ll end up writing misinformation on the internet.

Let’s pretend we have something against Screaming Frog, so we download Xenu – it’s what was used before Screaming Frog.

This is a heavier task than you might expect. A sample of the links from Search Console to this site yielded ~15k URLs to be checked. There is a 64bit Xenu beta available if you require more RAM. Here are the settings I’ve found most useful:

Check ‘Check External URLs’ under options.

File > Check URL list (test) to load in your URLs and start crawling.

Once complete, hit CTRL-R a few times to recrawl all the error pages (404’s will remain so, generic errors may be updated to more useful ones).

The main issue here is that by default Xenu isn’t going to give us the information we want in a nice TSV export. It does, however, offer us what we want in the standard html report – we get a list of URLs that link to us with a list of errors that stem from them. They typically take the following form:

http://01100111011001010110010101101011.co.uk/paid-links/
	http://sim-o.me.uk/
	  \_____ error code: 410 (the resource is no longer available)
	http://icreatedtheuniverse/
	  \_____ error code: 12007 (no such host)
	http://www.carsoncontent.com/
	  \_____ error code: 12007 (no such host)
	http://www.seo-theory.com/
	  \_____ error code: 403 (forbidden request)
	http://www.halo18.com/
	  \_____ error code: 12007 (no such host)

http://seono.co.uk/2013/04/30/whats-the-worst-link-youve-ever-seen/ http://seono.co.uk/xmlrpc.php \_____ error code: 405 (method is not allowed) https://twitter.com/jonwalkerseo \_____ error code: 404 (not found) http://www.alessiomadeyski.com/who-the-fuck-is-he/ \_____ error code: 404 (not found) https://www.rbizsolutions.com.au/ \_____ error code: 12029 (no connection) read more

Speeding Up Default WordPress Part 2 – Images

You can Read Part One (Speeding Up Default WordPress) here.

Image files are still the bulk of page weight for most blogs. They are the majority of page weight for the average page on the internet:

They will account for an even higher proportion of this page’s weight, given it’s full of screenshots about image weight.

Although it’s possible to squeeze the most speed out by delving into the guts of WordPress and cutting the chaff, for now we’re sticking to the things that we can control ourselves, with an emphasis on ease. This article mostly lists errors I wish I did not make with the images on this website and plugins that do things for us. read more

Preserve Link Equity With File Aliasing

The standard ‘SEO Friendly’ way to change a URL is with a 301 ‘moved permanently’ redirect. Search engines then attribute value to the destination page. This value is nearly as much as the original (assume 85-95%), if we believe redirects are lossy.

If we want optimal squeezing-every-last-drop-out SEO, we’re better off updating a resource on the same URL instead of redirecting that URL to a new location.

Stay with me.

But what if the resources are fundamentally different? Say I’ve enthusiastically converted a PDF to html. The filetypes are different. I’ve got to move from /resources/my-guide.pdf to /resources/my-guide, right?

Not so.

  • Someone requests a .pdf file we have painstakingly converted into html.
  • We serve them the .html version on the original (.pdf) URL.
  • We retain all historical ranking benefit that URL possesses.
  • URL may rank better due to the format change (additional semantic markup possible).
  • URL will likely rank worse for ‘ {query} .filetype’ queries.

Sounds great, but how do we “serve them the html version on the original filetype URL”?

Server Response Headers

We can rename file.html to file.pdf, but we’re just going to throw errors if we do that. First we examine the actual server response headers for legitimate uses using curl (this can be done in browser). This is an actual pdf:

curl -I http://URL1.pdf
HTTP/1.1 200 OK
Date: Wed, 28 Oct 2015 16:35:49 GMT
Content-Type: application/pdf
Content-Length: 51500

And this is an HTML page being an HTML page:

curl -I http://URL2
HTTP/1.1 200 OK
Date: Wed, 28 Oct 2015 16:35:43 GMT
Content-Type: text/html

This is an HTML page masquerading as a pdf on a pdf URL. Note the filesize:

curl -I http://URL3.pdf
HTTP/1.1 200 OK
Date: Wed, 28 Oct 2015 16:35:13 GMT
Content-Type: application/pdf
Content-Length: 4680325

Note the difference? The impostor is being viewed as a pdf, and the browser is attempting to interpret it accordingly. Given an html file can’t be opened in a pdf viewer, we get the following:

So we need to inform the browser that the pdf is not a pdf.

Overwriting the Server Header

Using the folder’s local .htaccess we overwrite any pdf to requests type in the header. Create a new .htaccess file and type the following:

AddType text/html .pdf

If it ends in .pdf, that means it’s a text-based html file. Not a pdf.

curl -I https://ohgm.co.uk/test/chicken2/potato.pdf HTTP/1.1 200 OK Date: Wed, 28 Oct 2015 17:40:32 GMT Content-Type: text/html read more

PDF to HTML (and SEO)

Last week I read Emma Barnespost on the Branded3 blog. It got me thinking. Essentially pdf’s rank fine but are a pain to track properly in Analytics, so translating them into a more friendly format like html is preferable.

Before you start reading the post please note: This is a curiosity (or dead end). This is not a viable ranking strategy. This is a waste of your time.

I initially thought that translating pdf to files to complete webpages probably wasn’t worth the time expenditure for developers in most cases. The resource already ranks, right? read more

Recursively Optimise Images

We know that images account for a lot of the internet:

We know that speed is good, and that page weight is not good for speed. We also know that lossless image optimisation exists; that smart people have made it possible to get smaller images of the same perceivable quality at the cost of processing power.

Unfortunately, our standalone content (I have pure “Content Marketing” content in mind here) is often fragmented over a number of directories. Image compression tools, there are many, are often drag and drop affairs set to process  single images and filetypes by default. This is not good if we’re trying to bake image optimisation into an organisation. When our images live in multiple folders withing a project, it’s disheartening for anyone to seek them out to process. This post aims to remedy that. read more