Google Insider: Yes, PageRank Determines Your Indexation Cap

2 min read

Rand posted about a month ago on Google’s indexation cap.  He wrote,

Google (very likely) has a limit it places on the number of URLs it will keep in its main index and potentially return in the search results for domains.

I have inside information that this is more than “very likely”, it is exactly right, at least since 2006.  I got this information on accident; it was forwarded to me as part of a response to a question I posed via friendly channels at Google.   The email quoted below is the full, unchanged response from a well-known SEO personality who worked at Google.

The problem that inspired my inquiry was almost exactly what Rand describes– 40% traffic drop, rankings on “head” terms were fine, but traffic from thousands of long-tail searches was gone, and the sinking realization that large parts of my site went Supplemental.   Here’s the answer I got:

On XX/XX/06, XXXXXXXXXXXXX wrote:

The main issue is the dramatic decline of backlinks (which is likely causing the PageRank issue). The duplicate content issue isn’t that there are lots of pages that are exactly the same, but that each page isn’t unique enough (too much boilerplate that is the same from page to page and not enough content that is different). But fixing that is not going to help too much. It’s mostly the backlink problem.

If the issue on backlinks is that other sites are linking to 404 pages and not actual pages, putting in 301 redirects for every incorrect link (which they can get a list of in webmaster tools crawl errors) will help.

I’ll see if I can find out if the # of backlinks actually did drop or if we just changed our algorithms to discount many of them. I’m not sure what, if anything, we’ll be able to tell him about what I find out though.

Supplemental results aren’t results that don’t change much, they are results that don’t have enough PageRank to make it into our main index (we can’t tell him that, of course). [emphasis added]

So to sum-up, the key points are:

  • Google imposes a cap on the number of pages you can have in the Main index (unless you have infinite inbound links, see next point)
  • The cap is determined by the number of backlinks to each page (PageRank)
  • Google’s Main Index includes only pages with “sufficient PageRank”
  • Everything else (pages with “insufficient PageRank”) goes into Google’s Supplemental Index
  • Google has never publically confirmed this

Questions I still have:

  • In looking at “backlinks,” with respect to the indexation question, is Google doing a simplistic link count or is “backlinks” a euphamism for PageRank? (My guess is that it’s the later, backlinks = PageRank).
  • Does Google care whether the backlinks/PageRank to a page come from internal or external sources? (My guess is that internal links still influence the “backlinks” to a certain page and thus it’s possible for a site to influence which pages are in the Main index versus Supplemental by “managing” PageRank via their link graph).
  • How might have Google change this algorithm in the past 3 years?

So if your link graph influences which pages are in the Main index (and hopefully I’ve said that generally enough to avoid wading into the debate over nofollow’s efficacy), there are some very striking implications which I’m sure others will explore.

For starters, maybe you deprive some pages of PR (links) so you can concentrate the PR on more valuable pages you can actually “lift” from the Supplemental results (i.e. pages associated with valuable keywords or where the SERP is weak enough to achieve a top 3 ranking).  I’ll leave the tactical question regarding whether nofollow tags are effective ways to do this to others.

Choosing a Domain? You Should Test It.

Can the right domain name be a competitive advantage in SEO and social marketing? Can the wrong name be a permanent drag? Absolutely! Imagine...
Jeremy Bencken
4 min read

Just signed up for WPEngine

WPEngine is a super fast WordPress hosting company based in Austin, TX that fills the hosting gap between WordPress’s super expensive privately hosted VIP...
Jeremy Bencken
34 sec read

Danny Sullivan’s Epic Rant on Links at SMX Advanced

In case you have not heard, Danny Sullivan, Editor of SearchEngineLand.com let loose with an incredible stream of consciousness rant about link building today...
Jeremy
4 min read

19 Replies to “Google Insider: Yes, PageRank Determines Your Indexation Cap”

  1. Not to simplify the complex, but the bots have to make a hard choice as to when a page goes supplemental. If they didn’t, any ecom site would be tacitly encouraged to just create as many product pages as was technically feasible (flood technique), so that if site A). Widgets-R-Us.com has 4,329 product pages, then the folks running site B). Widgets-Now.com will just up their product line in some way so as to have 43,329 product pages, or ten times as many product pages as their competitor. In the absense of any external linking signals this means site B has a mathmatically better chance (10 to 1) of appearing in the results compared to site A.

    If I’m a florist and I sell custom arrangements, I am only limited by my own creativity as to how many arrangements and thus product pages I can create.
    It could be tens of thousands. Are such boilerplate product pages of any value to anyone? That’s a tough question. Depends on who you ask, I suppose.

    Fascinating and useful Jeremy!

    Eric Ward – Link Marketing Strategist

  2. That’s an insightful comment Eric. Not to kiss up or anything, but…I’m a big fan of your etiologic content linking strategy. I think this situation begs the question: Did the pages that disappeared into the supplemental index really deserve to be in the SERPs? As you mentioned, Google obviously has to draw the line somewhere, so it’s certainly an argument for creating value in the pages that are most important to your business. If there’s no value, we can hardly blame Google for throwing them in the supplemental index.

    That aside, I definitely think internal linking strategies could be enough to raise a valuable content page out of the supplemental index, especially in the case of a strong domain. But it’s also an argument for creating new content on a regular basis, as Google seems to have a natural time-factor associated with some long-tail searches.

  3. Just to give you a bit more specifics on the SERP in question:

    http://www.google.com/search?q=gypsum+apartments

    This is one of those situations where it’s hard to say what should be in the SERPs because there actually aren’t any apartments in Gypsum. The #1 search result is pretty bad, and our page (presently #2) I think does a better job by showing stuff from the closest nearby cities.

    To get out of Supplemental, we both made our page content more unique (utilizing UGC, stats, and other non-deterministic data) AND interlinked them from within the rest of the site. Initially we didn’t see a movement from content optimization. Eventually, I think the links made the difference.

  4. It is not so simple… it never is

    Supplemental is more or less PageRank based though it can be spread pretty thin
    Primary index isn’t just PageRank, there are a lot of trust factors and can be boiled down to allocating a domain the amount of traffic they can afford to “pay” for.

  5. This is hardly ground-breaking news. Matt Cutts essentially revealed this information at SMX Advanced in 2007. But as Jill Whalen noted in her comment on Sphinn, the email does not indicate any correlation between amount of PageRank a “site” has and how many pages are indexed in Google.

    There is almost certainly an indirect correlation between PageRank and indexation and that might just be a natural consequence of crawling priorities. Google may not have to algorithmically tie the indexing to PageRank.

    We have yet to see any proof that Google (or any other major search engine) handles domain-level or site-level valuations.

  6. I don’t see how the e-mail or anything else in the article shows that “Google imposes a cap on the number of pages you can have in the Main index”. If backlinks were dropped for specific pages, they moved to Supplemental because of the lack of backlinks, not because the site had too many pages in the Main Index. What am I missing?

  7. First off, nice blog 🙂

    I’m not saying you’re wrong about this cap thing, but I think you have a problem in that the email you’re citing may substantiate some rumors, but it doesn’t really prove that an indexation cap exists, and it doesn’t really prove any of your key points.

    Seems to me you could use this email to justify almost any theory you wanted about backlinks, not just the concept of the indexation cap. I could say this email proves Google bases crawl depth decisions on PageRank and I wouldn’t be any more right or wrong than you are.

    Once again, not saying you’re wrong J, but it seems this email is a long long long way off from a smoking gun

    @eric ward – I know you’re trying to simplify the concept, but don’t you think you might be overstating the incentive for e-comm sites to try and use the flood technique? Seems to me that, in the absence of linking signals, the amount of unique/distinct content and searcher demand would have a much greater bearing on a site’s likelihood of appearing in SERPs than just the number of pages

  8. Ahh the classic forget to delete something from the email before we forward it, we’ve all done that before!

    Like Eric mentioned above the principle of this seems to make sense and been suggested before by some fairly reputable sources, but its a good reminder for someone who might have lost some links and seen their longtail traffic nose dive.

  9. @Michael You’re right in your BSB post when you say that the email doesn’t prove that’s there’s a theoretical limit to the number of pages you can get into the index. If you have infinite (or enough to be the functional equivalent of infinite) backlinks, you could theoretically have infinite pages (thus, no cap).

    For sites who are spreading their peanut butter thin (passing just enough to pages so that they have sufficient backlinks to stay in the Main index), a negative “adjustment” in the treatment of backlinks could cause some of their pages to fall into Supplemental, which as the site owner, feels pretty much the same as having a cap (sorry for reiterating what you pointed out on your blog, but it’s a really key point).

    In my case, when this happened, the issue was that the pages in question were relatively isolated from the rest of the site (they were small market city pages, so they only had links to them internally, from our ‘state’ page). The solution that got the pages back in the index was to increase their interlinking from within the rest of the site (specifically we linked to them from our larger city/market pages which themselves had a lot of deep inlinks), which pulled the submarket pages up enough to get them out of Supplemental (plus since the links were geographically-based, I like to think some semantic value was passed to the pages). Great comments, so much in SEO is a question of the practical consequences.

    @Amy and @Jason Thanks for the comments! I think the email shows that whether a page goes Supplemental is a function of its backlinks (and then I made the leap to interpret that as 1) PageRank, 2) a finite spreadable commodity, and 3) that ultimately it can get so small it’s insufficient to help the pages who receive it get out of Supplemental). As a practical matter, on really large sites (the one I was dealing with now has >1M pages), you end up with clusters of tens of thousands of pages that are only getting PR from other pages on the site, so the PR available to pass into those clusters becomes a limiter.

    Again, I’m assuming that the Peanut Butter Principle is true. If so, the PR of those other pages (who link to the cluster) becomes the determinant of whether enough juice is passed to those clusters of pages to stay in the Main index. The more PR they have, the more peanut butter there is to spread, so the more pages you can get in the Main index.

    One interesting view on this, and I think Matt might have said this at SMX Advanced ’07, is that new sites aren’t placed in any sort of sandbox, it’s just that they don’t have enough backlinks to get into the Main index.

  10. Its amazing how blog comment sections are the scientific labs of our time. I wonder how diffrent our world would be if DOCTORS would spend as much time brainstorming with other doctors , breaking down ideas and figuring out there true meanings , what effects them etc.

    The medical field has alot to learn from the SEO industry.

  11. Hi,

    For those who know little about SEO from the technical point of view, my advice is to keep writing regularly, preferrably original, useful and relevant content.

  12. @Adolfo this what I love about this field! So many people interacting and working together in an open and honest way to better understand the environment that we work in.

    @Jeremy great post. The true lesson here for me is the need to continue developing content that is good for my audience and the all important backlinks will follow ( not that I am suggesting you stop looking for backlink opportunities as well).

  13. Honestly i don’t think this is accurate. I have a site thats had it’s page rank bounce around alot from 4 to zero. It has no effect on traffic though – I definately get traffic coming into zero rank pages – and I can see they are actually pretty high in the main index.

  14. @Inisheer –

    I agree. I have pages with zero PR that rank, but they are not ranking for competitive keywords. As far as your PR ranking bouncing, you may be getting different ranks through Google’s different indexes. Not all of their indexes agree on PR.

Leave a Reply

Your email address will not be published. Required fields are marked *