Earlier today Apple confirmed the existence of their web crawler Applebot. This means that we’ll be seeing it crop up a little more in server log analysis.
Filtering Server Logs to Applebot
As anyone can spoof their useragent to Applebot while crawling the web, we can use the IP range Apple have given us to validate these rogue visits. Currently legitimate Applebot visits will come from any IP between 17.0.0.0 and 17.255.255.255. The actual range is probably substantially smaller than this. We can pull the files we need from out server logs using the following {linux|mac|cygwin} commands in our bash terminal:
First, filter to everyone claiming to be Applebot:
grep 'Applebot' access.log > apple.log
Then, filter to the 17.[0-255].[0-255].[0-255] IP range:
grep -E '17\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' apple.log > apple4real.log
It’s not quite perfect (sorry), but it seems to do the job.
Blocking Applebot
I’m not sure why you’d want to prevent Applebot from crawling, but should you need to block the crawler, you can use a standard robots.txt directive:
User-agent: Applebot Disallow: /forbid-apple/
Helpfully, they seem to be honouring Googlebot crawl directives in lieu of Applebot directives. Many bots are finicky about the case used in the “User-agent:”, so be sure to use capital ‘A’ for ‘Applebot’ to avoid the odd situation where they respect ‘Googlebot’ directives over ‘applebot’ directives. If you have any questions about this you can email apple directly…
If you’re feeling more adventurous and Apple aren’t behaving themselves you can use .htaccess (or web.config or nginx-config etc) to do this, with something like the following:
RewriteEngine on RewriteCond %{HTTP_USER_AGENT} Applebot [NC] RewriteRule ^ - [F]
This would return a 403 Forbidden. Alternatively you could use the [G] flag on the rewrite rule to return a 410 Gone response instead (I wouldn’t recommend doing this). Although blocking an IP range would be possible too, we probably don’t want to block legitimate users on “17.*.*.*”. When blocking by user-agent, user-agent spoofers will be the main collateral damage, which we can live with.
Although we aren’t likely to be pandering to Applebot in the same manner, it’s going to be very interesting to see how the crawl behaviour differs, especially given that it will default to Googlebot directives.
You can also see AppleNewsBot in your logs if you’ve created a publication in Apple News or if a user has added your RSS feed.
Blocking by IPv4 address range only isn’t very future proof.
Thanks for the comprehensive post. I’ve found Applebot misbehaving, spidering dynamic forms, despite robots.txt disallow. I don’t mind blocking through other means and this post helps a lot with that!