Server logs have a few major drawbacks, one of which I’d hope to address today. It’s not an elegant solution but it (more or less) works. Firstly, please read this post for an overview on server logfile analysis for SEO and you’ll hopefully see where I’m coming from. I think access logs are probably the best source of information available for diagnosing onsite SEO issues.
If you have a little experience with server logs, you’ve probably encountered the following:
18.104.22.168 - - [30/Sep/2013:08:07:05 -0400] "GET /resources/whitepapers/retail-whitepaper/ HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
22.214.171.124 - - [30/Sep/2013:08:07:06 -0400] "GET /resources/whitepapers/retail-whitepaper/ HTTP/1.1" 301 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Server logfiles provide us with the URI Stem (the portion of the URL after the host and any port number), rather than the full URL. For this blog post, that would be:
As logs give us URI references rather than URLs, essentially you are getting everything from the third trailing slash and beyond.
One of my clients has all the of their ccTLDs configured so that they share server logs. Server logs do not allow us to see which domain serviced the request. If you’re dealing with a site with a blog.domain.com setup, you won’t be able to tell if the main site or the blog serviced the request from the URI reference alone. The same goes for the http:// and http://www. versions.
I use the following method to gain some insight.
Firstly, cut down your server logs to size using whatever tools you’re comfortable with. I like grep or the filters in Gamut Log Parser.
grep "Googlebot" filename.log > filteredoutput.log
From this output, deduplicate by requested URI.
Concatenate this list based on whatever subdomain or ccTLD (or protocol: httpor https) you are interested in checking out and acquire status codes. For most SEOs, this will be done using Screaming Frog in list mode, or SEOtoolsforexcel.
Compare the status codes in the server logs with the status codes you’ve just acquired, using something like the EXACT command in Excel:
If they URLs return FALSE, then they aren’t from the same subdomain as the request made in the logs. Use this method to filter out undesirable subdomains and ccTLDs from your dataset.
Use this as you will.
This method does rely on error handling being more or less functional on the target domain (*.domain.tld/example always returns a 200OK, you won’t have as much luck).