This is a very quick transcription from my presentation at BrightonSEO, inspired/shamed (“oh yeah…”) by Nat putting hers up. This was pretty low-effort to do (an hour or so editing an autotranscription on the weekend) and is not very expensive if you want to pay someone to do it (though given some of the terminology I would recommend self-editing).
Here are the slides:
And here is the transcript (with some gentle removal of all the UMs). You’ll notice I’m less sarcastic when talking watching a timer, and much more surface level than I would be in a blog post.
Hello and welcome to my BrightonSEO presentation, Esoteric SEO Tips I Hope You Don’t Already Know. My name’s Oliver Mason or ohgm on Twitter.
Kelvin was kind enough to allow me to present without really having a clear topic, so to kind of string things together and make it work, I’m going to present the case that:
We’re basically limited by what we’re exposed to.
That’s the facts we know, the tests we’ve done, the things we’ve heard off other SEOs, articles we’ve read, the ideas, and the problems we’ve actually faced and worked through.
Now, the central thesis is the more edge cases and weird issues you understand the better you get at solving novel problems. The more techniques you understand and can apply to those problems, the faster you get at solving them.
This is one of the main ways we can improve as SEOs.
That’s all pretty straightforward, but unfortunately talking about edge cases doesn’t really make for a good conference presentation.
So I’m going to limit myself into talking about techniques.
Apologies in advance.
The caveat here is of course, as with all of my stuff, these are just ideas.
You don’t really have to go and apply it. You can just think about it.
So to get started Robots.txt Cloaking. It’s not really cloaking, but it is useful to keep the definition in mind. Cloaking is the practice of serving one set of content to users and one set of content (or URLs in the case of sneaky redirects) to search engines.
Now I’m not literally talking about cloaking, your robots.txt file, which is something an idiot would do. Incidentally, I’ve done this quite recently and my traffic has gone up, who knows if it will hold.
So we’re not actually cloaking but preventing crawl. That way, bots aren’t going to see something humans can, and you can prevent crawl with robots.txt.
And this is the quickest and easiest method for hiding any client side bullshit you might want to achieve now.
Good examples of this would be GEO-IP redirects whereby humans can only see one version of a website and bots are able to see every version, or intrusive interstitials, where humans get spammed with popups boxes. GEO-IP redirects are really shitty thing to do, but blocking them in robots.txt like this means you don’t have to make a stand on them for SEO reasons, just usability reasons.
You can still get in trouble, but most sites that do this, don’t – from what I can see. And I think this is because Googlebot is remarkably well behaved.
This is for a few reasons: one is that they’re trying to maintain trust in themselves as a bot and in the robots.txt standard.
Here’s a recent quote from Gary from Twitter (so it’s a primary source) is “can you stop spreading that robotsTXT is a hint? Disallow, always obeyed hard stance.“
Now, what’s nice about this is you can actually determine whether Googlebot isn’t obeying your robots.txt file by using sever logs. I think most people who analyse server logs will see that this is the case.
(A small note I wanted to add is that you can sometimes see employees of Google using the Googlebot, because different teams have access to it – but it’s not the same as the Googlebot being used for crawling and indexing. Generally, that is. Admittedly that that was a very niche edge case thing, but it could happen.)
So whether you want to use this to hide any client side bullshit would depend on your appetite for risk. I don’t think it’s that risky, but I don’t think it’s maybe the best method, but it’s one that works and is easy.
So moving on to our next section. Laundering irrelevance. Now this was something of a light bulb moment for me around expired content SEO. Just to recap how that generally is, you have a few standard options:
One’s update the content on the same URL.
Other than that, let it die – 404, 410.
SEOs tend to not like doing this because it loses any intrinsic value that post may have accumulated. They tend to redirect it to something sort of relevant and hope.
This is probably the best method that’s widely used.
So typically then you’re redirected to a category page or another close product page, say when a product’s being retired.
The fallback is being redirected to the home page plus hope.
This basically never works very well from what I can tell.
The last is keeping the page live as a really bad page and avoiding soft-404 detection. We’ll talk about that later.
We know irrelevant redirects are unlikely to pass ranking signals to destination pages, BUT if we update the content on THE SAME URL, we can change it into something we do know how to handle. And we know how to handle duplicate content.
And this is a bit counter intuitive.
But as an example on my website I retired an old page. Didn’t care about it, about five years old. This how canonicals are – but I just put up the same page twice. I’m just copying the code from the other page. Now because it’s there in the code, it copies that canonical across and it’s identical.
It’s same HTML, different URLs.
Now, Google looks at that and goes, yeah, these are the same. So it folds them together, folds the value across. Once they recognize this, we know they’re passing as much value as they can, and they are similar enough that you could redirect them and expect it to “work”.
I think it’s better to just leave them in place.
This technique isn’t limited either.
You don’t have to duplicate similar pages, just any page you’re retiring to something else. You can duplicate very important content, like a money page or something, and you don’t even have to do it across on the same domain now.
The bit that’ll trip people up is they’ll be thinking, “cool I can fold links together. I can use this with private networks or whatever”.
You probably want the external links to still make sense from a relevancy perspective, or they’re just not going to work very well. It’s a pretty useful technique.
Now, if you want the user to get something a bit more useful than just a weird page on a weird URL, you could use a script blocked in robots.txt to explain what’s happening.
It’d be like: “Hey, this is out of stock now“, or whatever you want to display.
That’s it, you you’ve conjured relevance from nothing, great work.
It’s Not All About PageRank
Moving onto the next section. It’s not all about PageRank, but when I act like it is, it goes well. So this is just about page rank sculpting. I don’t really understand PageRank.
I find the iterative algorithm idea quite hard to grasp intuitively, but I think that’s okay because I have a very basic, working understanding that I just go through. That’s “remove unnecessary links and the graph will go up”. Oversimplified, as with expired content:
Remove unnecessary links.
Link to URLs you care about from other URLs with lots of PageRank.
Link more prominently to URLs you care about against those you do not.
So that’s trying to get a reasonable SEFA into it.
This is the kind of Axiom I want you to hold to. So if a link doesn’t appear to exist, Googlebot doesn’t assign PageRank.
Removing a link will make it not appear to exist to Google bot.
Removing links is the most effective way of removing links.
Yeah, this is the sort of content you’re here for, but maybe to be a bit more complex – sometimes a link is not a link .
This is where we’re back to the definition of cloaking – presenting different content to humans and bots.
Have a little read through, there are all sorts of weird ways that you can achieve for what is for a user identical to a link – they click on stuff and stuff happens, but for a bot they just cannot remotely process it.
This article from search engine land, well worth a read and attempting to map out all the methods and what Google did or did not crawl.
I think anything Googled didn’t crawl is a good indicator of a link format which is obscure enough for them to not to count as a link. So some improper link, mark-up is very powerful as a way to remove links without removing functionality for users, but it’s pretty shitty UX and accessibility. So to be a good citizen of the web, you could just straight cloak (or don’t), with this method you would just not show search engines some internal links you don’t really care about.
To do this client side, you would insert all non priority links after the initial HTML using a navigation.js script. And then you might block that in robots.txt. That would do the job, but it seems like a bad idea for a few reasons.
I think it’s better to think about what Google bot isn’t willing to do because it’s expensive. And that’s act like a normal human user doing normal things like scrolling, touching, and clicking, which are all events you can assign things to. So treating that as “client side progressive enhancement“, you’re not doing anything like targeting a user agent or an IP range just to use that interaction.
Google doesn’t say much on this, and I don’t know if people ask them. I think what they want to say is pretty much “We don’t care about you failing to dynamically render links to your faceted navigation or whatever. What you’re doing seems like a waste of time. Please leave me alone.”
This is because I think the tendency for doing this sort of thing is it is very likely to go wrong so they could never publicly endorse it. I don’t think it would get you in trouble normally.
Bonus Tip: normally to diagnose whether what you’re doing is a good idea or not you can use search console, but you can also use it to inspect any URL on the internet and see how Google’s handling that, or is willing to handle it by using an open redirect on a property you control in search console and inspecting it.
It will show you the destination. You can take the HTML from that and paste it into Chrome to see what Googlebot is seeing pretty much.
Here’s an unrelated image of a site that started to cloaking out a certain class of links they had, which was wasting a lot of crawl and PageRank. Uh, the graph went up, which is great.
Here’s that same graph a bit later, the growth continues to go up throughout coronavirus (it was very good for them).
But yeah, you don’t have to trust graphs in SEO presentations. I think that’s not a good thing to do generally. So let’s just look at the graph that goes up and to the right and enjoy it.
I would also say don’t rush into anything like this.
It’s best to think about it because people might just think you’re an idiot. It will set off a lot of alarm bells.
Discovery, Not PageRank
On to our next section, which is discovery, not PageRank. This is a related idea from doing this sorts of work. The problem we’re working on here is more around getting crawled quickest rather than getting the most PageRank to a particular set of URLs.
It’s still really about PageRank, of course, but there are few ideas to bear in mind. So if you’re trying to get stuff indexed or crawled quickly, you can know a crawler has seen them, but you can’t know if it’s been added to the crawl queue yet. So really you can only go off what’s been requested in your server logs.
This has been crawled. And if you can live monitor your access logs, you can force feed Google bot. And if there’s one thing Google loves it is being forced fed novelty URLs. If you consider Google news site maps, especially if you’ve done any access log analysis, or if you’ve ever had a problem with a faceted nav opening up an infinite crawl space, you’ll know Google loves new stuff.
So his here’s the method for this technique.
Have a list of URLs you want Google bot to crawl. An uncrawled database. You sort this by the date added so that the newest URLs are always prioritized. And we take the top X lines, let’s say 10 20 of the uncrawled database and populate an entity linking widget using this.
Here’s a wonderful diagram to explain that more garbage goes on top and we only return the top slice to Googlebot. In our case, it was a site-wide linking widget, but you could also use something like priority HTML site maps as a way of getting it through brand and design, who do not want you ruining their website.
This idea was also used to populate a latest XML site map, copying the Google news idea, when discovery is more important than PageRank.
I think we did the top 20,000 lines. Now the technique is whenever Googlebot requests to an uncrawled URL, you remove that URL from uncrawled. This updates, the internal linking widget and the XML site maps.
Here’s how the latest XML site map behaved once we started doing this. As you can see a higher proportion get indexed, but this is a really difficult graph to read because, firstly, more than 20,000 URLs being appearing in this site map would be crawled per day.
This other graph puts it into better perspective, which is quite a lot.
Here is a more recent graph and updating that from when I published that in a blog post we’re now above a million indexed URLs. I think the rate of the indexing increasing has slowed because basically the site doesn’t have enough external page rank. I think it has less of a “DA” than my own blog. So it’s doing pretty well.
Here’s how traffic has gone from doing that. So, I mean, you might look and go, “Oh, that’s only three clicks a URL Oliver”, which would be a fair criticism.
Now, if you are doing this sort of thing, it’s also useful to know about these internal HTTP headers, from Google Search Console. I would look them up.
They’re applied to all URLs. I think what’s interesting is they will say things like, “even though we’ve crawled you, we’re indexing basis on the previous version”. Something to keep in mind when something’s not working quite correctly.
Top tip don’t rely on a canonicalised or no index’d URL path for discovery or PageRank.
Nofollow as Hint
So we’ll move on now to No Follow as Hint, which is the most tinfoil hat this gets. To tell a story, a client is trying to rank for service they don’t provide. But to rank the page (and for the page to be good in any way), they must link to SERP competitors who do offer the service. “Thinking about the user” but also really want to rank.
So, you nofollow those links cause you don’t want your competitors to rank.
BUT – “no follow as hint”. It’s not active yet, but when it is, these links are either nothing or the best link for the query the competitors could possibly get. And this kind of keeps you wondering. You decide against the other methods we’ve discussed so far because they make people hate you and you refer to this graph.
I think this is an official graph from Google, but I can’t remember where I found it.
The idea is that for nofollow they’ll decide. For UGC, they’ll decide. I don’t see why they should value those links so much, given they’re not a vote from the site owner. Um, maybe they will. Sponsored links are Sponsored Links. They shouldn’t be valuing them at all.
I was thinking about that cause of like, Oh, could make those real sponsored. Well, what’s the worst that can happen?
They’re saying the impact, if any, at all, would be “we might not count the link as credit for another page”. That’s exactly what I want to happen.
So that’s not a bad thing, but if you consider why we’re in this mess, a big part of it was editorially awarded links from publications being blanket nofollow when Google really wants to use that stuff in the link graph.
I think these publications are gonna do the exact same thing and make them all sponsored.
That doesn’t seem that bad, more or less the same situation where right now. But, the real bad thing is if they’re using it for training any of that spam detection algorithms.
The result there would be genuinely editorially awarded, followed links, get misidentified and devalued because they look like sponsored links.
That’s horrible enough that you’re going to do it. Great work.
Incidentally, we are now ranking with content for a service we do not provide and say all the other links are sponsored, sorry.
Here’s some quick tips just to end.
Intermittent rendering issues are super frustrating. The best way to get rid of them is confirm same resources, different transfer size. Using access logs this is simple enough, but to get past works on my machine, the fastest way I’ve found to do it is to try and break the website by:
taking screenshots as you do this,
and then browsing visually to show which ones are broken.
Then you can send them across and be like, Hey, it’s broken right now so that you don’t get, “Oh, it’s working“.
This is useful to do it in real time.
You can host a robots.txt file anywhere – even on another website and this can let you get past platform restrictions – just needs a 301. I did on my website to a sketchy Pastebin for over a year.
If you are doing anything, which involves a robots.txt being 301’d, possibly as an error during a migration, you just need to bear in mind that the new robots.txt applies to the old URLs.
So you can really run into trouble there.
Soft 404 detection is frustratingly simplistic and big can be worked around very easily just by changing the phrase. So, identify the phrase and change it. So “this product is done for” will be a 200, when “this product’s out of stock” would be a soft-404.
Googlebot’s still hard-coded to substitute out hashed URLs and you can still safely use them to hide things. If you want, you can log what Googlebot does on a page when it renders it, and it will return the hash for you. Alec Bertram has a script to do this.
This is a joke: GIFs encode things directionally, so if you save them at a 90 degree angle and rotate back with CSS, sometimes the file size will be smaller and that’s it.
The more edge cases you understand, the better you get at solving novel problems. And in the end techniques are all we have.
I run a small basement conference called o h g m c o n.
You should probably have a look at it. It’s quite fun.
Otherwise I’m an SEO consultant. Thank you very much for your time.