Another Way To Deindex URLs

I was speaking to Chris Johnson yesterday about the Horseman crawler, discussing the on-page/http checks necessary to determine whether a URL is indexable or not.

As part of this, I went to double check the wording of the meta robots=”none” value, which is a different way of spelling “noindex, nofollow” (but reads to me like “no restrictions”), and spotted the following:

unavailable_after: [date/time] Do not show this page in search results after the specified date/time. The date/time must be specified in a widely adopted format including, but not limited to RFC 822, RFC 850, and ISO 8601. The rule is ignored if no valid date/time is specified. By default there is no expiration date for content.If you don’t specify this rule, this page may be shown in search results indefinitely. Googlebot will decrease the crawl rate of the URL considerably after the specified date and time.Example:<meta name=”robots” content=”unavailable_after: 2020-09-21″>

https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#directives

Though I’d seen it in the documentation before, I’d never seen it in action. So I set this HTTP header sitewide to Jan 1st 1970.

A live URL Inspection test detects both the header, and shows the (indexed) homepage as available to Google for indexing:

But as soon as it was actually recrawled, the page no longer shows as indexed in Google Search Console:

Another Way To Deindex URLs 2 — Page is not indexed: Crawled – currently not indexed

A few hours later it stops showing up in search results entirely (this is likely a caching thing). This is now happening to all other pages:

Another Way To Deindex URLs 3 — The page has gone from indexed, to Crawled – currently not indexed. Note that **Indexing Allowed?** = Yes.

Another Way To Deindex URLs 4 — Conversely, a live test records the X-Robots-Tag, but shows that the page is (probably) available for indexing.

Three panel meme of man riding bicycle, putting a stick between the spokes while riding it, and rolling in the ground in pain (having fallen off his bicycle). — What you are currently imagining.

My issue is not that I’ve started deindexing my site, but that the tooling probably should not say “Indexing Allowed: Yes” on both live and stale tests after [date/time] passes:

Understandably, Google are pretty clear that they don’t surface every possible issue via the URL inspection tool:

What isn't tested
The test results don't test for the following things, which are required to appear in Google: The page and its structured data must conform to quality and security guidelines.
Whether the site is free of manual actions or security issues.
Whether the site or page is subject to content removal for legal or other reasons.
Whether a URL has been temporarily blocked in Search Console.
(Live test) Whether the page is a duplicate of another indexed page or an alternate. Typically only canonical pages (or appropriate alternates) are served. — https://support.google.com/webmasters/answer/9012289

Availability (live test)
This section of the tool describes whether it's likely that the page can be indexed by Google. However, a positive result is no guarantee that it will appear in Search results. The test doesn't check that the page conforms to Google's quality and security guidelines, any manual actions or security issues, content removals, or temporarily blocked URLs. Your page must pass all these checks during indexing to be available in Google Search results. Availability status
The availability status of the live URL. The following values are possible:
URL can be indexed: The URL can probably be indexed, but this isn't a guarantee, because the test does not check all possible issues.
A warning or failure value listed and described here.
Not available due to a site-wide issue: A site-wide issue is preventing a live inspection. The specific issue is described in the Page fetch field. — https://support.google.com/webmasters/answer/9012289

Indexing allowed?
Whether or not your page explicitly disallows indexing. If indexing is disallowed, the reason is explained, and the page won't appear in Google Search results. IMPORTANT If your page is blocked by robots.txt (see "Crawl allowed"), then "Indexing allowed" will always be "Yes" because Google can't see and respect any noindex directives. Because of this, your page might appear in Search results. If the indexifembedded attribute or header is found in the page in combination with noindex, the page will be indexed only when it is embedded. — https://support.google.com/webmasters/answer/9012289

But interestingly, unavailable_after: [date/time] is not an option that appears on the explicitly disallows indexing (block indexing) page. Functionally, this seems to be more directive than hint, as the first organic crawl seems cause the page to be dropped from the index.

This information should be available to Google (since they are honouring the directive). Detecting whether is is presently later than the time listed in a variety of “widely adopted format(s)” feels like it would be possible. You’ll know which formats are currently being detected in order to obey the the directive.

It will be far more frustrating for 3rd party tooling to implement:

Screaming Frog showing a URL as indexable

OK.

This is such an edge case that I wouldn’t honestly expect it to be covered by any tooling. Please let me know in the comments if your tool does (well done!).

Until this is something picked up in tooling, it’s kind of up to us to remember this is a potential way for clients to accidentally take their websites out of search.

A graph representing this website's traffic. It goes to zero very quickly.

So now that pages are dropping from the index at a comically aggressive pace, I will endeavour to remove the directive in the next few days.

(if you remember that time I added a 50 second delay to the site, apparently I forgot to remove it until very recently)

Beware 1969:

Thanks to Ragil for letting me know that this absolutely doesn’t work if you set the date before Epoch time.

It turns out that on a whim I’d set it to match whatever whim someone at Google built it with (or the whim of the creator of the time library they’d used).

Disclaimer: This blog post was written entirely with the free tier of ChatGPT using the following prompt:

“Please write a short post about some pointless Technical SEO minutiae in the style of ohgm. Be sure to credit yourself by including this prompt at the end.”

Regenerate response

4 thoughts on “Another Way To Deindex URLs”

Jon says:
29/03/2023 at 12:55 am
“Googlebot will decrease the crawl rate of the URL considerably after the specified date and time”
Wonder if you’ll have a hard time encouraging them back?
Julia @ Zangoose Digital says:
01/04/2023 at 6:39 pm
Technically, it’s not like you’re not allowing the indexing of the URL, you’re just saying that it’s not available after a certain date… so I don’t see a problem with it being reported as allowed for indexing. Not very informative for the site owner, yes, but technically correct.
My other question would be why would you choose to keep such a page in your sitemaps
Emanuel P says:
05/04/2023 at 1:45 am
Aha, so you also use “please” in your prompts.
Natan Gesher says:
24/04/2023 at 11:01 pm
Still the best SEO blog.

OK.

Beware 1969:

4 thoughts on “Another Way To Deindex URLs”

Leave a Reply Cancel reply