Rand posted about a month ago on Google’s indexation cap. He wrote,
Google (very likely) has a limit it places on the number of URLs it will keep in its main index and potentially return in the search results for domains.
I have inside information that this is more than “very likely”, it is exactly right, at least since 2006. I got this information on accident; it was forwarded to me as part of a response to a question I posed via friendly channels at Google. The email quoted below is the full, unchanged response from a well-known SEO personality who worked at Google.
The problem that inspired my inquiry was almost exactly what Rand describes– 40% traffic drop, rankings on “head” terms were fine, but traffic from thousands of long-tail searches was gone, and the sinking realization that large parts of my site went Supplemental. Here’s the answer I got:
On XX/XX/06, XXXXXXXXXXXXX wrote:
The main issue is the dramatic decline of backlinks (which is likely causing the PageRank issue). The duplicate content issue isn’t that there are lots of pages that are exactly the same, but that each page isn’t unique enough (too much boilerplate that is the same from page to page and not enough content that is different). But fixing that is not going to help too much. It’s mostly the backlink problem.
If the issue on backlinks is that other sites are linking to 404 pages and not actual pages, putting in 301 redirects for every incorrect link (which they can get a list of in webmaster tools crawl errors) will help.
I’ll see if I can find out if the # of backlinks actually did drop or if we just changed our algorithms to discount many of them. I’m not sure what, if anything, we’ll be able to tell him about what I find out though.
Supplemental results aren’t results that don’t change much, they are results that don’t have enough PageRank to make it into our main index (we can’t tell him that, of course). [emphasis added]
So to sum-up, the key points are:
- Google imposes a cap on the number of pages you can have in the Main index (unless you have infinite inbound links, see next point)
- The cap is determined by the number of backlinks to each page (PageRank)
- Google’s Main Index includes only pages with “sufficient PageRank”
- Everything else (pages with “insufficient PageRank”) goes into Google’s Supplemental Index
- Google has never publically confirmed this
Questions I still have:
- In looking at “backlinks,” with respect to the indexation question, is Google doing a simplistic link count or is “backlinks” a euphamism for PageRank? (My guess is that it’s the later, backlinks = PageRank).
- Does Google care whether the backlinks/PageRank to a page come from internal or external sources? (My guess is that internal links still influence the “backlinks” to a certain page and thus it’s possible for a site to influence which pages are in the Main index versus Supplemental by “managing” PageRank via their link graph).
- How might have Google change this algorithm in the past 3 years?
So if your link graph influences which pages are in the Main index (and hopefully I’ve said that generally enough to avoid wading into the debate over nofollow’s efficacy), there are some very striking implications which I’m sure others will explore.
For starters, maybe you deprive some pages of PR (links) so you can concentrate the PR on more valuable pages you can actually “lift” from the Supplemental results (i.e. pages associated with valuable keywords or where the SERP is weak enough to achieve a top 3 ranking). I’ll leave the tactical question regarding whether nofollow tags are effective ways to do this to others.