- From: Karl Dubost <karl@w3.org>
- Date: Mon, 30 Apr 2007 11:43:45 +0900
- To: Dan Connolly <connolly@w3.org>
- Cc: public-html@w3.org
Doing a survey is tricky but very interesting, we need to clearly
define the methodology so that we know how to interpret the results.
Some previous results gave only the compiled results which makes it
difficult to interpret.
Le 25 avr. 2007 à 07:22, Dan Connolly a écrit :
> "Clarification would be needed on the top200 vs. top200-US sites
> survey
> suggestions. The latter one would clearly produce skewed results, but
> the former one should also not be more than a tiny source of input, as
> top sites usually don't build HTML pages, they buy them instead."
> -- http://www.w3.org/2002/09/wbs/40318/tel26Apr/results
Some notes I had in mind about it.
# bias in "top" choices
Let's say a sample of 200 web sites. If chosen according to pagerank,
alexa index, etc. we will get a picture of the state of the Web which
is mainly driven by consulting companies and heavy content management
systems. Plus the fact that some big companies have more than one
official Web site. So the statistics will show mainly the
implementation problems in commercial CMSes.
If we choose the 200 Web sites of "blogosphere" (though many of them
are not detected). We will get a photo that will be English,
dominated by WordPress or Movable Type. But more likely to be close
of Web Standards and defeating the purpose of whatwg.
The sample could be done by tools producing content as well. etc. etc.
If we take a random sample on the Web it is also showing another kind
of statistics, including legacy content, which has not been modified
for years, and will not be modified anymore.
So the sample has to be clearly defined or they could be more than
one sample.
# not only Home page
Let's not limit the survey to the home page of the Web site. The home
page is often a "business window". Some people might fix the home
page and not the rest of the site. So I would at least fetch all
links from the home page and go deep at least one level. so a Web
site is not its home page. That would also give more variety in the
markup.
# User-Agent sniffing and content delivery
This one is a tricky and a kind of nightmare. User-Agent sniffing is
used a lot and people using one user agent will not receive the same
page than the one receiving another user agent. That has to be tested
by making the crawler faking user-agent to see what we receive and
see if there are differences.
# Content Negotiation
Related to the previous one on User-Agent sniffing. There is content
Negotiation based on
languages (french, english, etc.)
localization (detecting IP addresses)
accept-headers
Sometimes you end up with very different type of content.
# The English assumption
Many products are implemented with the English assumption, which
means we don't see a part of the Web just because we are not looking
for it. It happens with content negotiation. It happens also with
class names, etc. It would be very difficult to draw conclusion for a
Japanese speaker about class names written in romaji or pin-yin, to
understand if they represent a general trend of using class names and
therefore have the need to standardize it.
http://www.globalvoicesonline.org/2007/04/16/japan-number-1-language-
of-bloggers-worldwide/
# Results interpretation
When a Web site send a different representation for one URI because
of the all possible combinations above, it becomes difficult to
create statistics or at least raw statistics. Hans Rosling showed
very well at TED Talks how stats could hide some meaningful results
and leads to false conclusions.
But on all of that yes, that would be very useful to have a survey.
If the methodology is well explained then it becomes useful for
everyone and the samples can be adapted to local usage.
I wonder how mature is htmlib now? Anne?
Last time I tried on some web sites to make stats about HTML elements
in documents, I ran into a bug (which has been fixed since).
--
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager, QA Activity Lead
QA Weblog - http://www.w3.org/QA/
*** Be Strict To Be Cool ***
Received on Monday, 30 April 2007 02:44:04 UTC