Cowsay seo is dead

Filter Server Logs to Googlebot

Googlebot is one of the most impersonated bots around. Scraping tools will often use this particular user-agent to bypass restrictions, as not many webmasters want to impede Googlebot’s crawl behaviour.

When you’re doing server log analysis for SEO purposes, you might mistake a scraping tool’s ‘crawl everything’ behaviour for Googlebot behaving erratically and ‘wasting crawl budget’. It’s important that you don’t solely use the reported user-agent string for any analysis you might be conducting, but instead combine this with other information to confirm what you are seeing is actually Googlebot (and not you auditing the site with Screaming Frog).

In this post I’d like to offer snippets of code you can play around with to achieve this. You probably shouldn’t just copy and paste code from the internet into your command line.

If your analysis is less than a million rows, it might be easier persisting with Excel, but you’ll miss out on the authentication magic (unless you’re willing to try this). The section below is recommended if you want to filter an unwieldy data set down to a manageable size.

Filtering Server Logs to Googlebot User Agent

First you’ll want to filter down to every request claiming to be Googlebot. This removes legitimate user traffic and cuts down on the number of lookups you have to make later. You need this information for your analysis, anyway.

We’ll be using grep, which is a default utility in OSX and Linux distributions. Other utilities like awk or Sift will also do the job quickly. They’re both significantly faster than Excel for this task.

Gow UtilitiesIf you’re on Windows, get GOW (GNU’s Not UNIX On Windows?) to give you access to some of the commonly available tools on Linux distributions and OSX (alternately you can use Cygwin). You’ll need to navigate to the folder containing the files you wish to search and open a terminal window. Once GOW is installed, Windows users can hold CRTL+SHIFT and right click into the folder containing the file/s you wish to analyse, allowing you to open a command window.

Grep uses the following format:

grep options pattern input_file_names

If you’re stuck at any point you can type the following into the command line:

grep --help

Right now, we don’t need any of the optional flags enabled:

grep "Googlebot" *.log >> output.log

This will append each line containing ‘Googlebot‘ from the file/s specified into a file in the same folder called output.log. In this case, it would search each of the ‘.log’ files in the current folder (‘*’ is useful if you’re working with a large number of server log files). Under Windows the file extensions can be hidden by default, but the ‘ls‘ command will reveal their True Names. The double-quote characters are optional delimeters, but are good practice as they work on Mac, Windows, and Linux.

If you just wanted to filter a large dataset down to the Googlebot useragent, you can stop here.

Filtering Logfiles to Googlebot’s IP

You’ll now have a single file containing all log entries claiming to be Googlebot. Typically SEO’s would then validate against an IP range using regex. The following ranges are from chceme.info:

From            To
64.233.160.0    64.233.191.255    WHOIS
66.102.0.0      66.102.15.255     WHOIS
66.249.64.0     66.249.95.255     WHOIS
72.14.192.0     72.14.255.255     WHOIS
74.125.0.0      74.125.255.255    WHOIS
209.85.128.0    209.85.255.255    WHOIS
216.239.32.0    216.239.63.255    WHOIS

We can use this information to parse our server logs with a hideous regular expression:

((\b(64)\.233\.(1([6-8][0-9]|9[0-1])))|(\b(66)\.102\.([0-9]|1[0-5]))|(\b(66)\.249\.(6[4-9]|[7-8][0-9]|9[0-5]))|(\b(72)\.14\.(1(9[2-9])|2([0-4][0-9]|5[0-5])))|(\b(74)\.125\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(209\.85\.(1(2[8-9]|[3-9][0-9])|2([0-4][0-9]|5[0-5])))|(216\.239\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)

For readability:

# Match One of the following C Block IP Ranges
# Avoid matching 3 digit A Block Ranges

((\b(64)\.233\.(1([6-8][0-9]|9[0-1])))|
(\b(66)\.102\.([0-9]|1[0-5]))|
(\b(66)\.249\.(6[4-9]|[7-8][0-9]|9[0-5]))|
(\b(72)\.14\.(1(9[2-9])|2([0-4][0-9]|5[0-5])))|
(\b(74)\.125\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|
(209\.85\.(1(2[8-9]|[3-9][0-9])|2([0-4][0-9]|5[0-5])))|
(216\.239\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))

# Add D Block IP 0-255

\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)

Grep can do this once we invoke the extended regex flag (-E). “egrep” also works as an alias for this:

grep -E "((\b(64)\.233\.(1([6-8][0-9]|9[0-1])))|(\b(66)\.102\.([0-9]|1[0-5]))|(\b(66)\.249\.(6[4-9]|[7-8][0-9]|9[0-5]))|(\b(72)\.14\.(1(9[2-9])|2([0-4][0-9]|5[0-5])))|(\b(74)\.125\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(209\.85\.(1(2[8-9]|[3-9][0-9])|2([0-4][0-9]|5[0-5])))|(216\.239\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" output.log > genuinearticle.log

This gets us closer, but it’s not Google’s preferred method. It’s also very fragile – if these ranges change, you need to rewrite the expression.

Reverse DNS Googlebot

Given Google like to jump their services around, they suggest the following:

To verify Googlebot as the caller:

  1. Run a reverse DNS lookup on the accessing IP address from your logs, using the host command.
  2. Verify that the domain name is in either googlebot.com or google.com.
  3. Run a forward DNS lookup on the domain name retrieved in step 1 using the host command on the retrieved domain name. Verify that it is the same as the original accessing IP address from your logs.

they also note:

I don’t think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

The a second DNS lookup using the output of the first acts as verification, but adds complexity.

host & dig

host and dig commands for ohgm.co.ukAlthough other utilities exist (notably nslookup), dig appears to be the most available and reliable across the three platforms. It’s a default in OSX, along with host. dig and host utilities for Windows can be downloaded from ISC.org. If not already included, dig and host can be added to most Debian distributions with:

apt-get install dnsutils

Once installed, we can use the utilities for reverse and regular DNS lookups on our list of IP addresses. The host command outputs the following:

host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

dig using the  -x (reverse lookup) and +short modifiers returns the more friendly output that can be fed back into the tool as-is:

dig -x 66.249.66.1 +short
crawl-66-249-66-1.googlebot.com.

#checking against the output
dig crawl-66-249-66-1.googlebot.com. +short
66.249.66.1

Take the list of unique IPs addresses you have and run the first query against them. Then run the second, using the output from the first. You can write a loop in your preferred command-line to achieve this.

Afterwards, grep this new data for ‘Googlebot.com’ or ‘Google.com’.

grep -E "Googlebot.com|Google.com" input > output

As this is a short list, it will usually be quickest to reintegrate the output using something like Excel if you are more comfortable there. Enjoy.

Taking this further

Requests that present themselves as “Googlebot” by user-agent, but not Googlebot’s IP range can be saved to an impostors file. The match invert flag in grep (‘-v’) can do this for us. You’ll also have in your hands IP’s which are DNS spoofing (if you didn’t prefilter your IP addresses). Your mileage with this information may vary, but it’s useful to have on hand, especially if you want to block them in future.

 

Leave a Reply

Your email address will not be published. Required fields are marked *