Re: Google Code: Web Authoring Statistics from Karl Dubost on 2006-02-06 (public-evangelist@w3.org from February 2006)

From: Karl Dubost <karl@w3.org>
Date: Mon, 6 Feb 2006 12:44:52 +0900
To: "'public-evangelist@w3.org' w3. org" <public-evangelist@w3.org>
Cc: Ian Hickson <ian@hixie.ch>
Message-Id: <F3813B7F-E34F-4C7C-9F47-6CAB8F1F4DDB@w3.org>
Le 06-02-02 à 18:36, Pid a écrit :
> While I'm not yet sure that there's a case for HTML5, it does address
> something I've been thinking about for a while.  The use of  
> semantically
> attributed elements denoting functions "header", "footer", "nav",  
> "menu"
> are (already widely accepted to be) in widespread use - the report is
> more evidence of that.

Ok let's see the work done by Ian Hickson (Google) on this.
First of all, I have to say that it is a study that I was wishing to  
see for a very long time. So it's quite cool to have these data  
available.

For example, there is the list of most common elements which have  
been used in Web pages
http://code.google.com/webstats/2005-12/pages.html And that in  
general Web pages use around 19 elements, and then the 19 elements  
are given.

I would have been happy to have more than the list of 19 elements, I  
would say that I would have liked to see the 35 elements, or as we do  
in astrophysics to measure the relevant data for a bell curve (Gauss)  
is the width at half size. It means that somehow you remove what is   
used always because necessary and you remove what is almost not used.

 From Ian's graphics, I would say what are the elements between 12  
and 27. If it's two complicated to give at least the list of the 35  
elements.

Another indicator which is interesting:

* presentational element
	b
	font
	table

Yes I have added table because I'm quite sure the high frequency of  
table is not due to tabular data, but using table for layout.

I wish that browsers developers had implemented in an interoperable  
way: "display: table;" in CSS, first. Maybe that would have avoided  
this. :) Maybe not.


About classes.
http://code.google.com/webstats/2005-12/classes.html

*Most pages* do *not* use class attributes. That's interesting. It's  
difficult to analyze this, specifically in terms of legacy documents,  
etc. Maybe having another way to do the stats by having the date of  
last update of the document would help too. For example a page with  
legacy code which is 7 years old doesn't mean exactly the same thing  
than a page which has been created now.

So why no class attributes, a series of *hypothesis* (these are not  
affirmations):
	* Elements are enough most of the time
	* People don't care about semantics
	* Authoring tools *do not* provide an easy way to edit semantics  
without entering the source.
	* People don't know what class attributes are used for
	* others?

There's a message for communities like microformats or RDF/A, the way  
we edit our data is important, and it seems hard to create class  
names or any other specific attributes. It seems editing tools are  
important in this case.

I liked very much the list of class names. For example, "msonormal"  
seems to show that people use wysiwyg tool to just save their  
documents and don't know about the source code.
Something missing here, that I would like to see in a next version,

	- the list of class names by language.

Certainly the top 20 will be english names because of the numbers of  
English documents, and also because of the bias introduced by search  
engines with linguistics indexing. So I'm very interested to know in  
other languages. Why? Because it has consequences. if you start to  
make an element for something which was a class name, you will  
constrain the user in one semantic model and where people before may  
have used "pieddepage" in a class attribute, they will have to use  
"footer" in an element. So we have to be careful on that.

Ian gives a list of correspondance between class statistics and HTML  
5, it would be cool to give also XHTML 2

for example for "nav", the equivalent in XHTML 2.0 is "nl"
http://www.w3.org/TR/xhtml2/mod-list.html#edef_list_nl
"header" will be "h" in XHTML 2.0, etc.

There is something, I'm doubtful though. These following class names
	text, content, main, body  	article
are not used most of the time for an "article" in WepApps 1.0 or a  
"section" in XHTML 2.0
http://www.w3.org/TR/xhtml2/mod-structural.html#edef_structural_section
But there are used when we create a layout. There are more  
presentational somehow than semantics. Or let's say it's the main  
section where there will be the text (outside of menu, footer,  
header) and can contain more than one articles.

I think what will happen in case of "article" or "section", people  
will do.

(here you can do "section" or "article")
<section class="main">
	<section class="indiv">
	</section>
	<section class="indiv">
	</section>
</section>

So I'm not sure the class name is related to this element at all.

I'm surprised by the presentational element "small" in WebApps 1.0.  
Why not keeping "font" in this case? specifically when it is said  
later on "Beyond the top 20, many of the classes are of a  
presentational nature (clear, style2, bold...), and most of the  
values that don't fall into that bucket are synonyms for the top 20".  
Why small more than others?

class="title"  Here a better analysis would be interesting too. For  
example, I'm using a lot  this for title of movies, books, etc. I  
wonder if title will be used for a microformat at a point.

Ian says: "The rest of the top 20 classes are either presentational  
or otherwise meaningless (msonormal, for example, which is one of the  
classes that Microsoft Office uses in its "HTML" output). "

Well, could we see them and decide why they are meaningless. :)

for the class="link", it's happening when you create menu and you  
want to style hover features, etc. I do not say, it's good, but I see  
a lot of web designers doing it. For example, look at this article of  
Molly E. Holzschlag
	http://molly.com/articles/markupandcss/1999-09-class.php
and you will find a lot of examples of links. class are used as an  
indicator of behaviour.

I agree with this, it is said: "These probably deserve a little more  
study."


Again for this page about HTTP headers
http://code.google.com/webstats/2005-12/httpheaders.html
I would like to see stats with doctypes too, for mime types, and also  
by type of Web servers. Is there some web servers which are better  
configured than others? or easier to configure?

* Page headers
http://code.google.com/webstats/2005-12/pageheaders.html

Ian says: "The most-used attribute on html elements is xmlns, from  
misguided people using XHTML but sending it as text/html. They even  
(just) outnumber the people who specify the lang  attribute!"

Hehe it seems not that harmful, it seems. I'm mean if we look on the  
pragmatic side. People are really using xmlns=""  and !!!!  
xml:lang="" which means that namespaces do not seem to be that evil  
or difficult when they are included by editors. So it seems to show  
that editing tools are really important and they *can do* the job.  
The fact they are here even served with the wrong mimetype doesn't  
disqualify them. It would be like disqualifying all HTML elements  
served without doctype.  So I really think that the comments in this  
excerpt are to be neutral in terms of analyzing the stats.

=> head="profile"

Yes again, authoring tools. Even if it's simple to add it to the  
header, people do not edit source code manually, most of the time.   
Again this is a strong message to microformats and RDF/A communities.  
Having to type things explicitly doesn't work always.

For XFN, no doubt is the most popular, Tantek has launched the first  
microformat with it, and has made profile attribute popular with it.  
Thanks to Tantek to have wake up one of the forgotten HTML  
attributes. It shows also, it's not because a feature of HTML/XHTML  
is not popular that it's not useful, but mostly that sometimes it's  
just not well known and people lack of use cases for it and tools do  
not provide an easy way to put them.


* metadata
http://code.google.com/webstats/2005-12/metadata.html

Interesting to see that meta keywords and description are here and  
that there are used, which seems to indicate that there are knowm for  
a very long time. if I look at editing tools, or web site generators  
by templates, often the user have the possibility to edit them.


For the type of common mistakes, we see in this kind of things, I  
wonder if a module for the log validator would be helpful.


* table element

There's something interesting that Ian says about typos. When we find  
a page with an element or an attribute which is mistyped.  It seems a  
very good indicator that a part of the page has been written by hand,  
as opposed to an authoring tool. That would also help for making the  
statistics.


* link
http://code.google.com/webstats/2005-12/linkrels.html

it show again the influence of tools.


* a
http://code.google.com/webstats/2005-12/element-a.html

Ian says: "From the point of view of changes to the specifications,  
these findings are quite important. The rarity of rev and coords  
suggests that those features could be removed from HTML without any  
difficulty. In contrast, the ping attribute, proposed in HTML5,  
didn't appear on the list at all, so it is likely that adding it will  
not cause any problems on existing sites."

So, if I understand, sometimes it's good to add a feature in WebApps  
1.0 because it's used everywhere in classes. And sometimes it's good  
to add a feature, because even if not used it will not be harmful.  
But there are things of XHTML 2.0 which should not be used/created  
because there are not used. I have difficulties with processing the  
logics ;) I think there are interesting things to see in both  
specifications.


* Editors

And why authoring tools are important. Just do this search
http://www.google.com/search?q=%22Welcome+to+Adobe%22

Though it seems that all the web doesn't give so much importance to  
title. good or bad. in this case Good.
http://www.alltheweb.com/search?q=%22Welcome+to+Adobe%22


So this study raises many many questions, so just by this fact it's  
really cool. :)




-- 
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager, QA Activity Lead
   QA Weblog - http://www.w3.org/QA/
      *** Be Strict To Be Cool ***
Received on Monday, 6 February 2006 03:45:05 UTC