Re: survey of top web sites from David Dailey on 2007-04-25 (public-html@w3.org from April 2007)

From: David Dailey <david.dailey@sru.edu>
Date: Wed, 25 Apr 2007 12:05:24 -0400
To: Maciej Stachowiak <mjs@apple.com>,sean@elementary-group.com
Cc: Dan Connolly <connolly@w3.org>,public-html@w3.org
Message-Id: <6.2.5.6.1.20070425110307.01e2bf08@sru.edu>

At 10:30 PM 4/24/2007, Maciej Stachowiak wrote:

>Alexa is also believed to be a misleading indicator of overall 
>traffic in some cases. I have seen posts where people showed 
>internal server logs showing their traffic going up, even as Alexa 
>reported their traffic going down. I think it is still a useful 
>resource but we have to be aware that it might be non-representative 
>to a significant degree.

I agree. And furthermore...

Surveying the most popular sites (visits, links, duration of visits, 
familiar, ...) gives one view of HTML as it is practiced, by popular 
sites. It is natural to ask "are popular sites representative of the 
web as a whole?"

There are at least two differences between popular and "other" that 
we might expect: 1. popular sites are probably less likely to engage 
in "adventurous" behavior (unless you are one of the companies 
represented on W3C HTML WG of course) -- that is,  they are less 
likely to push frontiers and edges of use-cases. Too much is at stake 
to be very experimental 2. they are more likely to be coded well.

Limiting an investigation to those sorts of sites, might tend to give 
a false sense of security about just how robust the standards are, 
vis a vis, how many sites might fail.

In addition to these popular sites, those working on the survey might 
also want to consider accumulating a collection of outlier cases as well.

I suspect there are probably two Zipf laws at work (see 
http://en.wikipedia.org/wiki/Zipf's_law) : one concerning high 
frequency sites (the top 200 probably represent,  what, maybe  3% of 
over all web traffic? -- just a guess) -- the other Zipf concerning 
frequency of the use of certain features (over the combined 
vocabularies of HTML, CSS, JavaScript, and DOM).  Using, for example, 
input type=file  in conjunciton with image.onload to interrogate 
ranges of consecutively numbered files on the client's machine (which 
I think some folks are arguing should be broken) will not break many 
cases but it will break a few worthwhile instances.

In addition to the relatively homogeneous population of web sites 
found in the top 200 -- some additional methodologies might make 
sense: random sampling of web sites, quasi-random sampling of 
"interesting" web sites (such as afforded by Stumble Upon), 
collections of fringe cases. If anyone is interested, I've got some 
naughty fringe cases.

cheers,
David Dailey
http://srufaculty.sru.edu/david.dailey/javascript/JavaScriptTasks.htm

Received on Wednesday, 25 April 2007 16:05:37 UTC