Like many SEOs I’ve been spending some time with the new Search Console. This got me thinking – how trustworthy is the Last Crawled data, and what can it tell us?
Crawl dates in this new version are given down to the second:
If accurate this is (ignoring UI delays) incredible for testing. We’ve never had this level of granularity before. It also allows us to test some long held intuitions.
To start I looked at pages excluded with a noindex tag, and their last crawl dates:
Since all are set to noindex, they aren’t crawled as frequently. Would a Fetch and Render count as crawl for the interface?
The initial selection and submissions were made on the 30th of January. The reports are updated daily, but using data which may be delayed by up to one week (I’ve not seen consistent behaviour here yet).
If none of the dates change, then we can be sure that Fetch and Render does not count as crawl for Search Console’s purposes. Looking from the 13th of Feb, the Graph goes up to the 10th of Feb. Only a single URL from Fetch and Render submissions has been crawled:
Notice the the crawl timing is 17 mins after submission, and that only the 1st submitted URL was crawled over the two weeks. This may be a coincidence, but I think it’s more likely to be intelligent crawl scheduling tagging these URLs with a tiny positive signal.
Fetch and Render submissions do not count for ‘last crawl dates’ reported in search console, nor do they meaningfully encourage crawl.
But what if we also request indexing?
Crawled Not Currently Indexed Due to Complete Lack of Interest
According to the new Search Console, these are my most terrible posts:
Nothing prevents their appearing in the index but a lack of quality. On Feb 5th I submitted each of these 10 for indexing using Fetch and Render to see if this would trigger ‘crawl’:
Surprisingly for 6 of the submitted articles, the pages were reported as crawled and indexed:
The old Search Console version’s timestamps are recorded in GMT-8 (the time in Mountain View, CA). The new Search Console interface reports in local time (for me, GMT).
Keeping this in mind, check out the timestamps. Now check out the timestamps on submission:
They do not align with the submission times (even with 8 difference), meaning that they are given to the Scheduler, rather than added to the front of the crawl queue.
Here are the Googlebot requests to ‘intentions-and-permissibility‘ around this period:
The first two requests are fetch and render. The first highlighted request is pictured Fetch and Render request:
12:55AM Mountain View Time = 08:55AM in UK Time.
This logged request aligns perfectly. The next highlighted request 08:57:05 is either just after my submission for indexing (possible), or a crawl they have made of their own volition.
The third highlighted request is the aligns with the ‘last crawl date‘ pictured from the Search Console report:
09:26:34 in Search Console = 09:26:33 in Server Logs
But the final request, 5 mins later and using the smartphone UA is not reported as the last crawled time in search console. Even accounting for a reporting delay (it’s the 13th as I write this), the crawl should have appeared by now.
If I had to guess, this lack of Smartphone Crawl counting as last crawl could be down to ohgm.co.uk not being in the mobile first index (~20% crawl is smartphone). Maybe it is as simple as that. But the new search console is not reporting (for this website at least) mobile crawl data under ‘last crawled’.
Each of the subsequently indexed URLs had additional requests after the first Fetch and Render. This makes some sense. The indexing request made from the interface can be placed significantly later than the initial Fetch & Render request- the ‘request indexing’ button stays in place for a while.
To avoid mischief (e.g. getting Google to index page A, immediately serving page B to users), a second crawl is required. Even URLs which are immediately indexed are done with a second crawl. Google simply can’t trust Fetch and Render data without opening themselves up to our bullshit.
Four of the submitted URLs did not get indexed. The crawl dates in Search Console have not updated for the URLs either:
In the server logs the URLs have one request each for the period – which matches The Fetch and Render made in Search Console. And since this submission doesn’t seem to count for indexing purposes, the request must have been ignored, or didn’t amass enough priority to get scheduled in two weeks.
If this is the case, then ‘Request Indexing’ merely adds more priority to a URL in the queue. It’s no guarantee that Google will even properly crawl the thing for reconsideration.
This seems in line with Google’s general approach to Hints and Directives (which are also hints). This isn’t criticism – this is the way things have to be on this scale. But it does indicate that the ‘submit to index’ button really isn’t the panacea many of us take it to be (as I write this there is some Twitter discussion around Google removing the wording around limits to this option).
Requests to URLs Blocked in Robots.txt
Given my predilection for robots.txt, I find this section the most interesting. Check the crawl timings on these indexed pages when sorted by last crawl date:
They’re all nice and unique. Now check the crawl timings on these blocked in robots.txt URLs Google is complaining about:
Notice the two blocks with identical timings. These ‘crawls’ were made in the same second. Obviously Googlebot isn’t crawling these URLs, because it’s very well behaved and always obeys robots.txt* (and I checked the server logs).
Attempting to crawl a URL blocked in robots.txt looks a bit like this:
- URL comes up for crawl from the scheduler.
- Before making a request Googlebot checks against the cached robots.txt . A fresh one isn’t downloaded against each request (but 24 hours is the longest they’ll use an old one).
- Googlebot finds that the URL is not open for crawling according to it’s idiosyncratic interpretation of the robots.txt file, and the URL is not requested.
- The next scheduled URL is tested.
These are very nice examples of scheduling in action. Since no requests is actually being made, just a test against a cached file, Googlebot can speed through scheduled blocked URLs. Thousands can be done in under a second if the scheduler happens to bundle them in this order.
Bulletpoints to Aid Reading Comprehension
- Although Fetch and Render submission is Googlebot at full capabilities, it doesn’t actually count for indexing purposes. A submission may tag the URL with ‘mild interest’ for any scheduling.
- Fetch and Render with Request Indexing is a much stronger signal, but it does not force scheduling.
- A separate crawl after Fetch and Render is done for the actual indexing, and this makes sense to avoid shenanigans from people like us.
- We can trust the crawl times reported in the new search console. Those crawl times are the crawls actually used for indexing.
- The new search console reports last crawl time in local time. The old search console reported it in Google time. This is a nice feature.
- ‘Attempted Crawl’ timings for URLs blocked in robots.txt are included in last crawl timings.
* Earlier in this article I made a reference to Googlebot always being well behaved. I’d like to clarify that this was a joke. If you enjoy this sort of thing consider applying to speak at my creepy basement SEO conference.