Many SEOs don’t really trust the Site:URL command. Most SEOs also don’t trust the “About X results” numbers that appear when you make a Google search. I didn’t either, and had always thought that they must be pulled out of the air, and pretty much useless.
Have you ever scrolled to the end of the results to see how closely the numbers match up? For example:
We know that Moz has more than 700 pages in the index (anyone from Moz, please comment your business critical WMT numbers below), and that the number is probably some way closer to 226,000 than it is to 700.
One place Google does give us trustworthy indexation information is in the Index Status report in Webmaster Tools. The main howler of an assumption in this post is that the data in Webmaster Tools might be trustworthy. Non-round numbers seem trustworthy to me, and I don’t really see why they’d include them for no reason at all. Their indexation figures regarding sitemaps are accurate. The aim is to compare the numbers this private data throws out, with the publicly available results of Site:URL searches.
With that in mind, let’s compare some Site:URL searches to the relevant Google Webmaster Tools output. To be clear, I’m comparing the figures of indexed pages from WMT > Google Index > Index Status:
with the number provided in these screenshots:
|WMT Indexed||Google Results||Raw Difference||Percentage|
The thing worth seeing here is the tendency for Google to overestimate the number of in the search results, even with rather specific numbers. The numbers are off, but they’re not that off. I no longer think they’re that off for ordinary queries either, which surprised me.
Now, it should be made clear (and in fact, Google do make it clear) that just because something is “indexed”, does not mean it will appear in the search results. This might account for some of the difference. I suspect that in addition to this time delay between the numbers reported in Webmaster Tools and the Index plays a part, too.
Site #13 was the only really anomalous result in the dataset, which I think I might have an explanation for (but would have to be infuriatingly vague to do it). Lots of subdomains.
These numbers do change from day to day, by a small margin, especially for the larger domain. But so do the numbers of indexed pages on these domains.
There doesn’t seem to be any difference in the competition data returned using different Google TLD’s – I saw the same “about X results” and “Ungefähr X Ergebnisse” for site searches for a German website in Google.com, Google.co.uk, and Google.de. I found this consistency to be the case based on using proxies from a number of countries. Trying this again later, I was able to replicate the consistency between ccTLDs, but not the number itself (the new figure was within ~2% of the old figure).
Still, Google doesn’t give us all the entries in their index from site: searches. We knew this already. For Scraping, this is why people use ‘Stop Words’, which aim to influence Google’s ordering of the results, so you can attack the same problem (getting all the information in Google’s index for a given query) by approaching it from multiple angles. I haven’t been successful using generic stop words, so a crawl followed by an index check is probably the best approach for checking indexation of a domain you don’t own.
If I were to repeat this test, I would use Scrapebox’s Competition Finder. I recommend you use it if you are going to check against a lot of domains:
So, do you now trust the numbers a little more? Can you replicate the results using properties in your own Webmaster Tools? Do you think this means we can trust the numbers a little more for ordinary queries?