- From: Karl Dubost <karl@w3.org>
- Date: Mon, 30 Apr 2007 11:43:45 +0900
- To: Dan Connolly <connolly@w3.org>
- Cc: public-html@w3.org
Doing a survey is tricky but very interesting, we need to clearly define the methodology so that we know how to interpret the results. Some previous results gave only the compiled results which makes it difficult to interpret. Le 25 avr. 2007 à 07:22, Dan Connolly a écrit : > "Clarification would be needed on the top200 vs. top200-US sites > survey > suggestions. The latter one would clearly produce skewed results, but > the former one should also not be more than a tiny source of input, as > top sites usually don't build HTML pages, they buy them instead." > -- http://www.w3.org/2002/09/wbs/40318/tel26Apr/results Some notes I had in mind about it. # bias in "top" choices Let's say a sample of 200 web sites. If chosen according to pagerank, alexa index, etc. we will get a picture of the state of the Web which is mainly driven by consulting companies and heavy content management systems. Plus the fact that some big companies have more than one official Web site. So the statistics will show mainly the implementation problems in commercial CMSes. If we choose the 200 Web sites of "blogosphere" (though many of them are not detected). We will get a photo that will be English, dominated by WordPress or Movable Type. But more likely to be close of Web Standards and defeating the purpose of whatwg. The sample could be done by tools producing content as well. etc. etc. If we take a random sample on the Web it is also showing another kind of statistics, including legacy content, which has not been modified for years, and will not be modified anymore. So the sample has to be clearly defined or they could be more than one sample. # not only Home page Let's not limit the survey to the home page of the Web site. The home page is often a "business window". Some people might fix the home page and not the rest of the site. So I would at least fetch all links from the home page and go deep at least one level. so a Web site is not its home page. That would also give more variety in the markup. # User-Agent sniffing and content delivery This one is a tricky and a kind of nightmare. User-Agent sniffing is used a lot and people using one user agent will not receive the same page than the one receiving another user agent. That has to be tested by making the crawler faking user-agent to see what we receive and see if there are differences. # Content Negotiation Related to the previous one on User-Agent sniffing. There is content Negotiation based on languages (french, english, etc.) localization (detecting IP addresses) accept-headers Sometimes you end up with very different type of content. # The English assumption Many products are implemented with the English assumption, which means we don't see a part of the Web just because we are not looking for it. It happens with content negotiation. It happens also with class names, etc. It would be very difficult to draw conclusion for a Japanese speaker about class names written in romaji or pin-yin, to understand if they represent a general trend of using class names and therefore have the need to standardize it. http://www.globalvoicesonline.org/2007/04/16/japan-number-1-language- of-bloggers-worldwide/ # Results interpretation When a Web site send a different representation for one URI because of the all possible combinations above, it becomes difficult to create statistics or at least raw statistics. Hans Rosling showed very well at TED Talks how stats could hide some meaningful results and leads to false conclusions. But on all of that yes, that would be very useful to have a survey. If the methodology is well explained then it becomes useful for everyone and the samples can be adapted to local usage. I wonder how mature is htmlib now? Anne? Last time I tried on some web sites to make stats about HTML elements in documents, I ran into a bug (which has been fixed since). -- Karl Dubost - http://www.w3.org/People/karl/ W3C Conformance Manager, QA Activity Lead QA Weblog - http://www.w3.org/QA/ *** Be Strict To Be Cool ***
Received on Monday, 30 April 2007 02:44:04 UTC