Re: survey of top web sites from Karl Dubost on 2007-05-28 (www-archive@w3.org from May 2007)

From: Karl Dubost <karl@w3.org>
Date: Mon, 28 May 2007 09:38:31 +0900
To: David Dailey <david.dailey@sru.edu>
Cc: connolly@w3.org, www-archive@w3.org, st@isoc.nl, zdenko@ardi.si, sean@elementary-group.com
Message-Id: <7735994D-BE05-4061-B9E0-1CF4CE8FFCF3@w3.org>
Hi David,

Going back to old emails and trying to move forward things.

Le 1 mai 2007 à 00:41, David Dailey a écrit :
> As I mentioned (http://lists.w3.org/Archives/Public/public-html/ 
> 2007Apr/1544.html), Sander and
[…]

> My idea was to form a stratified sample of web pages at each of  
> several points of the spectrum of web pages: a) top 200, b) Alexis  
> 500, c) random, and d) "weird" or fringe cases that would be  
> assembled by hand. And then to cross that with a variable  
> representing instances of either standards or browsers

It seems a good start.

> Your approach (to what may ultimately be a different problem)  
> considers a number of things I didn't. Though the browser sniffing  
> stuff you mention is something I was thinking about. I don't know  
> if one can robotically parse a document so that it looks like it  
> would in Opera, FF, Safari, IE, etc. or not.

It is possible to fake the user agent string so we would know if the  
Web site is sending different versions depending on the browsers.

> I was rather naively assuming a fleet of grad students would fill  
> out that part of the experimental design by hand.

The fleet is the most difficult thing to find these days. As we have  
noticed for the last few days. There is always someone to comment or  
disagree (which might be constructive too) but only a few for helping  
to do stuff to move forward.

> The other thing that is relevant to the discussion I think is the  
> issue of the many different kinds of web content (sorta like you  
> mention) -- blogs, news feeds, ordinary web pages, wikis, HTML  
> fragments, print, email, etc. That could get complicated fast it  
> seems.

Yes. Complicated at the start.
The problem is always when doing statistics to create a good sample.  
Polling organizations have exactly the same problem.  Without knowing  
exactly the sample or with a bias sample, we will create a bias in  
our data.
But let's not be complicated at the start, and I suggest we follow  
your initial path and that we analyze what is the problem with the  
sample. What kind of bias?
More than trying to create a good sample upfront.

> Also germane to the discussion may be some of the stuff that I  
> think the folks interested in usability studies might be concerned  
> with. See for example http://lists.w3.org/Archives/Public/public- 
> html/2007Apr/0962.html, in which the classes of pages are further  
> classified into types by author types (e.g. search engines v  
> corporate etc.)
>
> It may make some sort of sense to convene a conversation unioning  
> both the survey and the usability folks, since some of the  
> methodological concerns may in fact overlap. Just an idea --  
> thinking out loud.

For further development. yes.

So we need

* the list of Web sites in the sample.
* The source code of the bots to scrap the content of Web sites
* Documentation of analysis methods
* tarball files with results.



>
> David
> --------<quote>---------------------
> The other two folks I mentioned [zdenko and sean, cc-ed above] are  
> involved in the business of sampling the 200 sites, so it might be  
> best to get them involved as well. I didn't sign up for this  
> particular task since standards effectiveness is a more tangential  
> concern of mine. (though I am really glad someone is looking at it.)
>
> I would tend to think the methodology oughta look something like this
>
>        method of evaluation
>       standards      browsers
>       S1 S2 S3      B1 B2 B3 B4
> p p1
> a p2
> g p3
> e p4
> s p5
>
> where both standards and browsers are used as repeated measures for  
> pages.
> Pages are randomly chosen within categories C={Top200/50,  
> Alexis500/50, random50, weird50)
>
> One samples 50 of each category and then one has a classical mixed  
> model analysis of variance with repeated measures and only one  
> random effects variable. Dependent variable can be either discrete  
> (+ or -) or continuous. Doesn't much matter last time I studied  
> statistics. Then we have a somewhat striated sample that can be  
> compared across sampling strategies.
>
> But the idea is to sample as divergent a group of pages as possible.
>
> To get the random 50 -- I'm not sure what the best methodology is  
> -- I suggested StumbleOn (but it has its own idiosyncracies) -- I  
> remember some search engines have a "find a random page" feature so  
> one might be able to track down how they do that. Someone on our  
> group must know.
>
> To get a weird 50 -- I have a couple of ecclectic collections  
> <http://srufaculty.sru.edu/david.dailey/javascript/ 
> various_cool_links.htm>http://srufaculty.sru.edu/david.dailey/ 
> javascript/various_cool_links.htm is one
> <http://srufaculty.sru.edu/david.dailey/javascript/ 
> JavaScriptTasks.htm>http://srufaculty.sru.edu/david.dailey/ 
> javascript/JavaScriptTasks.htm is another
>
> Both are peculiar in the sense that they attempt to probe the  
> boundaries of what is possible with web technologies -- some are  
> heavily Flash some are heavily JavaScript -- many don't work across  
> browsers and in many cases I don't know why. Too busy to track it  
> all down. (some of my pages are several years old and used to work  
> better than they do now). My emphasis has been far less on  
> standards than on what works across browsers -- the standards and  
> browsers generally seem to have so little to do with one another.
>
> A proper methodology for weird sites: have a group of volunteers  
> explain what they are looking for (a collection of fringe cases)  
> and let others contribute to a list. I don't know. A simpler  
> methodology: have a group of volunteers just sit and come up with a  
> list of sites believed to push the frontier.
> ------------</quote>--------------------

-- 
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager, QA Activity Lead
   QA Weblog - http://www.w3.org/QA/
      *** Be Strict To Be Cool ***
Received on Monday, 28 May 2007 00:38:37 UTC