Re: Web page stats from Karl Dubost on 2006-09-29 (www-qa@w3.org from September 2006)

From: Karl Dubost <karl@w3.org>
Date: Fri, 29 Sep 2006 14:05:47 +0900
To: www-qa@w3.org
Cc: Ian Hickson <ian@hixie.ch>
Message-Id: <EE4E5A84-CE10-4E1D-8955-574E81ED7543@w3.org>
Le 29 sept. 06 à 04:59, Bjoern Hoehrmann a écrit :
> Defining what a representative sample is and checking
> whether a certain sample meets the definition is a rather non-trivial
> excercise here. How many Amazon article pages do you include, or how
> do you weight them, how do you filter out automatically generated spam
> blogs, how do you detect, say, Wikipedia mirrors, and so on.

Indeed, and there are many parameters in the equation.
It is why I have asked more details to Ian Hickson, because I really  
think it is as much important as the derived statistics which have  
been published in the [previous survey][1]. When the sample is not  
given or clearly identified it is really difficult to draw meaningful  
conclusions.

For *each page*, we need all this information. It means a big set of  
data, but I think it is necessary. Giving only percentages would not  
help to analyze consequences between tuples or triples of data.

Web page data:

    - HTTP Date.
      Why: to identify if there's improvement in time for one page
           to identify the type of population through times. the Web  
from 5 years ago to the Web of now.
      Issue: Some dynamic Web sites don't cache correctly this  
information.
             Half solution, on one year, going on the page a few  
times through one year and verify the date, compared to the MD5 value  
of the page.
    - Mime-Type sent.
    - DOCTYPE
    - Is it well-formed (for XHTML ones)?
    - Is it valid?
    - Access to the page: DNS, connection          	

Do people see other type of datas? For now, I would like to focus on  
the meta level of the document more than the statistics about the  
element/attributes demography.


It is important also to specify the tool and the version which has  
been used for validation. Validators have troubles too. If we want to  
be consistent, we have to be careful about the tool we are using.

[1]: http://code.google.com/webstats/


> So, as of one week ago, 18% of W3C Members had a homepage that passed
> the W3C Markup Validator, compared to 9% when I started the survey 2
> years ago, and pages with 10 or less erorrs are up from 28% to 43%.

Very interesting.
Thanks for this Bjoern. I will not draw quick conclusions. But it's  
at least encouraging. It would require a bit more exploration. I  
think there is a room to develop a regular survey with a sample which  
would be clearly identified. I will see what we (W3C + external  
participation) can do in this area. I'm gathering requirements.

> So I don't know much about what Karl is asking for either, but it  
> seems
> justified to say that for up to date pages, authors pay more attention
> to syntax problems than they did some years ago;

my questions were ill-formed. Your mail helped to clarify.

> and no matter how you
> look at it, less than 20% of pages are "valid" or "well-formed" or  
> "con-
> forming" under some definition of those terms, in which case picking a
> good sample to derive meaningful results becomes rather important.

Definitely.


-- 
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager, QA Activity Lead
   QA Weblog - http://www.w3.org/QA/
      *** Be Strict To Be Cool ***
Received on Friday, 29 September 2006 05:06:09 UTC