Re: survey of top web sites from Karl Dubost on 2007-04-30 (public-html@w3.org from April 2007)

From: Karl Dubost <karl@w3.org>
Date: Mon, 30 Apr 2007 11:43:45 +0900
To: Dan Connolly <connolly@w3.org>
Cc: public-html@w3.org
Message-Id: <C273309F-C08E-40C8-99CE-5124CE95FA7B@w3.org>
Doing a survey is tricky but very interesting, we need to clearly  
define the methodology so that we know how to interpret the results.  
Some previous results gave only the compiled results which makes it  
difficult to interpret.

Le 25 avr. 2007 à 07:22, Dan Connolly a écrit :
> "Clarification would be needed on the top200 vs. top200-US sites  
> survey
> suggestions. The latter one would clearly produce skewed results, but
> the former one should also not be more than a tiny source of input, as
> top sites usually don't build HTML pages, they buy them instead."
>  -- http://www.w3.org/2002/09/wbs/40318/tel26Apr/results


Some notes I had in mind about it.

# bias in "top" choices

Let's say a sample of 200 web sites. If chosen according to pagerank,  
alexa index, etc. we will get a picture of the state of the Web which  
is mainly driven by consulting companies and heavy content management  
systems. Plus the fact that some big companies have more than one  
official Web site. So the statistics will show mainly the  
implementation problems in commercial CMSes.

If we choose the 200 Web sites of "blogosphere" (though many of them  
are not detected). We will get a photo that will be English,  
dominated by WordPress or Movable Type.  But more likely to be close  
of Web Standards and defeating the purpose of whatwg.

The sample could be done by tools producing content as well. etc. etc.

If we take a random sample on the Web it is also showing another kind  
of statistics, including legacy content, which has not been modified  
for years, and will not be modified anymore.

So the sample has to be clearly defined or they could be more than  
one sample.



# not only Home page

Let's not limit the survey to the home page of the Web site. The home  
page is often a "business window". Some people might fix the home  
page and not the rest of the site. So I would at least fetch all  
links from the home page and go deep at least one level. so a Web  
site is not its home page. That would also give more variety in the  
markup.


# User-Agent sniffing and content delivery

This one is a tricky and a kind of nightmare. User-Agent sniffing is  
used a lot and people using one user agent will not receive the same  
page than the one receiving another user agent. That has to be tested  
by making the crawler faking user-agent to see what we receive and  
see if there are differences.


# Content Negotiation

Related to the previous one on User-Agent sniffing. There is content  
Negotiation based on
	languages (french, english, etc.)
	localization (detecting IP addresses)
	accept-headers
Sometimes you end up with very different type of content.


# The English assumption

Many products are implemented with the English assumption, which  
means we don't see a part of the Web just because we are not looking  
for it. It happens with content negotiation. It happens also with  
class names, etc. It would be very difficult to draw conclusion for a  
Japanese speaker about class names written in romaji or pin-yin, to  
understand if they represent a general trend of using class names and  
therefore have the need to standardize it.
http://www.globalvoicesonline.org/2007/04/16/japan-number-1-language- 
of-bloggers-worldwide/



# Results interpretation

When a Web site send a different representation for one URI because  
of the all possible combinations above, it becomes difficult to  
create statistics or at least raw statistics. Hans Rosling showed  
very well at TED Talks how stats could hide some meaningful results  
and leads to false conclusions.


But on all of that yes, that would be very useful to have a survey.
If the methodology is well explained then it becomes useful for  
everyone and the samples can be adapted to local usage.


I wonder how mature is htmlib now? Anne?
Last time I tried on some web sites to make stats about HTML elements  
in documents, I ran into a bug (which has been fixed since).






-- 
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager, QA Activity Lead
   QA Weblog - http://www.w3.org/QA/
      *** Be Strict To Be Cool ***
Received on Monday, 30 April 2007 02:44:04 UTC