- From: David Dailey <david.dailey@sru.edu>
- Date: Wed, 25 Apr 2007 12:05:24 -0400
- To: Maciej Stachowiak <mjs@apple.com>,sean@elementary-group.com
- Cc: Dan Connolly <connolly@w3.org>,public-html@w3.org
At 10:30 PM 4/24/2007, Maciej Stachowiak wrote: >Alexa is also believed to be a misleading indicator of overall >traffic in some cases. I have seen posts where people showed >internal server logs showing their traffic going up, even as Alexa >reported their traffic going down. I think it is still a useful >resource but we have to be aware that it might be non-representative >to a significant degree. I agree. And furthermore... Surveying the most popular sites (visits, links, duration of visits, familiar, ...) gives one view of HTML as it is practiced, by popular sites. It is natural to ask "are popular sites representative of the web as a whole?" There are at least two differences between popular and "other" that we might expect: 1. popular sites are probably less likely to engage in "adventurous" behavior (unless you are one of the companies represented on W3C HTML WG of course) -- that is, they are less likely to push frontiers and edges of use-cases. Too much is at stake to be very experimental 2. they are more likely to be coded well. Limiting an investigation to those sorts of sites, might tend to give a false sense of security about just how robust the standards are, vis a vis, how many sites might fail. In addition to these popular sites, those working on the survey might also want to consider accumulating a collection of outlier cases as well. I suspect there are probably two Zipf laws at work (see http://en.wikipedia.org/wiki/Zipf's_law) : one concerning high frequency sites (the top 200 probably represent, what, maybe 3% of over all web traffic? -- just a guess) -- the other Zipf concerning frequency of the use of certain features (over the combined vocabularies of HTML, CSS, JavaScript, and DOM). Using, for example, input type=file in conjunciton with image.onload to interrogate ranges of consecutively numbered files on the client's machine (which I think some folks are arguing should be broken) will not break many cases but it will break a few worthwhile instances. In addition to the relatively homogeneous population of web sites found in the top 200 -- some additional methodologies might make sense: random sampling of web sites, quasi-random sampling of "interesting" web sites (such as afforded by Stumble Upon), collections of fringe cases. If anyone is interested, I've got some naughty fringe cases. cheers, David Dailey http://srufaculty.sru.edu/david.dailey/javascript/JavaScriptTasks.htm
Received on Wednesday, 25 April 2007 16:05:37 UTC