I’m fairly obsessive about cutting down on redirect chains. One of the biggest challenges for doing this is finding enough historical data. A few developers leave, the Ecommerce manager disappears under mysterious circumstances, and the organisation no longer has access to this information. The following technique is very useful once the orthodox sources have been exhausted.
You’ve probably seen archive.org before, and even used the wayback machine to diagnose problems. It’s one of the best SEO tools available, and one of the best things on the internet. Most of you will be well aware of this screen:
Fewer people are aware of this feature:
This gives us a list of the unique pages archive.org has catalogued since it’s been running. Let’s say we’re on a migration project, and we want to make sure the historical URLs Googlebot keeps crawling are not going through redirect chains unnecessarily. Using this list it’s easy to update the first URL to the final destination, and to fix any long-forgotten URLs that 404 unnecessarily. Google does not forget. If archive.org has found it, then you can be pretty sure Google has too.
This top level information is very helpful for sites that have changed in structure or domain multiple times over the years. This fills in the gaps in organisational knowledge.
So we want this in a format we can use a crawler on. By default the page visibly loads the first 50 entries. This can be changed via drop-down to 100, which isn’t enough for most cases.
So my usual choice scrape similar fell down for me here, and acted (reasonably) as a screen scraper rather than a source code scraper. Thankfully there’s no AJAX trickery to contend with, and the URLs are just sat in the html.
Most of the readers here will be familiar with SEO Tools for Excel. This is incredibly quick and avoids any custom scripting work. Install it, open up Excel, and paste in the following:
This XPath grabs all the hyperlink text (not the links to archive.org versions) from the tables. Replace with your website as appropriate.
This call on seomoz.org returned around 150k URLs into an Excel document. As you might expect, on analysis there were a large number of chained redirects (some resulting in five redirects).
At this point we’d update the .htaccess to redirect each unique URL that is currently chained to the appropriate final destination.
There are performance questions to be weighed here, and you will need to get the relevant server administrators on your side (so perhaps don’t push the obsessive-timewasting-marginal-gains-angle). As SEOmoz has migrated, so we wouldn’t have to be too concerned with a performance hit from 150k mapped redirects.
If you’d like to quickly view the current domain in the web archive, use the following:
Create a new bookmark, paste the above code as the location.
Query The CDX API
You can see this in action here:
All you need to do to use this data is a quick find and replace to eliminate the port numbers and any other cruft. Once this is done, crawl to eliminate redirect chains and dead ends.
Taking This Further
If you want an even bigger dataset, you can batch download every page ever archived and extract the internal links. Deduplicating these will give you a very large historical dataset, perfect for crawling to eliminate redirect chains.
For this process I use the Wayback Machine Downloader to download a copy of all URLs matching a criteria (e.g. older than timestamp). From these I gather a list of unique internal links. However, this is a bit excessive for most, and the method above combined with orthodox methods will get you >80% of the way. Don’t become a domain necromancer. Turn back now.