Re: Web growth

From: Jim Pitkow (pitkow@parc.xerox.com)
Date: Wed, Mar 17 1999


Date: Wed, 17 Mar 1999 13:47:15 PST
To: "'www-wca@w3.org'" <www-wca@w3.org>
From: Jim Pitkow <pitkow@parc.xerox.com>
Message-Id: <99Mar17.134837pst."147517"@mailback.parc.xerox.com>
Subject: Re: Web growth


Intranets are a different beast altogether and since they are not publicly
inspectable or measurable, I would take the view that they are out of our
characterization scope.

The presence of robots.txt is quite low actually, though I don't have an
exact number (but will try to get one).  Either way, the notion of size for
me centers around what is publicly accessible, so robots.txt should not
influence this much.

At 12:47 PM 3/17/99 , Johan Hjelm wrote:
>Two factors may affect this calculation: 
>1: The definition of pages. How do we characterise a page that is generated
>in response to a cookie set at an earlier visit when I come back ("Welcome
>back Johan Hjelm")? Is it one of the 700 million, even if it didn't exist
>before and never will again? How do we account for frames? (well, these are
>my common gripes)
>2: The domain investigated. Do you take intranets into account? In that
>case, we may underestimate it. Does the Alexa robot respect robot.txt and
>robot metatags (I think I remember it does)? In that case, is it reasonable
>to expect that 50 % of all publicly accessible pages are on servers that
>restrict access? It may be high - but not unreasonable.
>
>It seems to me we need to investigate these aspects before we can say
>anything definite. 
>
>Johan
>
>At 12:36 1999-03-17 -0800, Jim Pitkow wrote:
>>
>>Yeah, that's the trouble fitting three data points.  The latest I heard
>>from Alexa was that they've got around 200-300 million pages during their
>>last crawl, so 700 million seems a bit high.
>>
>>At 10:48 AM 3/17/99 , Lavoie,Brian wrote:
>>>Ed and I did some back-of-the-envelope calculations in regard to the growth
>>>numbers Jim posted:
>>>
>>>We fitted three different trendlines (power, linear, and exponential)
>>>through the three data points from Compaq SRC for the number of Web pages.
>>>Interestingly, the R-squared for each was about the same, although the
>>>exponential had the best fit (use 120 as the scalar, 0.0829 as the growth
>>>rate, in terms of months). Using the exponential trend and extrapolating to
>>>Mar. 99 suggests there are about 743 million Web pages currently. Is this
>>>figure plausible? Well, in July 1998, Vinton Cerf estimated there were
about
>>>350 million pages, so given the above extrapolation, in 8 months the number
>>>of Web pages would have doubled, which is pretty close to the doubling rate
>>>Jim estimated. So there may in fact be about three-quarters of a billion
Web
>>>pages out there now.
>>>
>>>Brian Lavoie
>>>OCLC
>>> 
>
>************************************************************
>                     Johan HJELM
>       Ericsson Research, User Applications Group 
>         Currently visiting engineer at the W3C
>             The World Wide Web Consortium
>                     hjelm@w3.org
>   http://www.w3.org/People/W3Cpeople.html#Hjelm
>    Fax +1-617-258 5999, Phone +1-617-263-9630
>   MIT/LCS, 545 Tech. Sq. Cambridge MA 02139 USA 
>        opinions are personal, always my own, 
>  and not necessarily those of Ericsson or the W3C. 
>============================================================
>