Re: Comments on HTML WG face to face meetings in France Oct 08 from noah_mendelsohn@us.ibm.com on 2008-11-13 (public-html@w3.org from November 2008)

From: <noah_mendelsohn@us.ibm.com>
Date: Thu, 13 Nov 2008 11:18:53 -0500
To: Dean Edridge <dean@dean.org.nz>
Cc: public-html <public-html@w3.org>, www-tag@w3.org
Message-ID: <OFFE3EAA94.2389B871-ON85257500.0056C56F-85257500.00599E97@lotus.com>
(with some trepidation I'll follow your lead in cross-posting to 
public-html and www-tag;  fyi, I do not regulary read the former, though I 
do look at archives occasionally)

Dean Edridge writes:

> Sorry, but I don't get this "clean content" thing.

I don't want to start a long flame war here, as I felt I had a good chance 
to express my feelings at the F2F, but I'll be glad to clarify what I 
intended (speaking for myself, not the TAG).  Let's start with some things 
that I think we all agree.  In particular, HTML5 as drafted provides that 
browsers will accept quite a range of input as text/html.  For example, 
all of the following will be parsed into DOMs, and presented to users if 
retrieved as text/html:

a) <!-- clearly OK -->
   <html>
   <body>
   <div>
   <p>Para</p>
   </div>
   </body>
   </html> 

b) <html>
   <body>
   <div>
   <p>Para</div>   <!-- note bad nesting of tags -->
   </p>  <!-- note bad nesting of tags -->
   </body>
   </html>

c) <html>
   <body>
   <!-- quoted attr -->
   <img src="http://example.com/img.jpg">
   </body>
   </html>

d) <html>
   <body>
   <!-- unquoted attr -->
   <img src=http://example.com/img.jpg>
   </body>
   </html>

e>  XXXXXX (Isn't obviously HTML at all,
            but browser will presumably
            build a DOM and render XXXXXX)

The best example I have of 'unclean' are (b), in which the close tags are 
in the wrong order, and (e), which has no tags at all.  As far as I know, 
an HTML browser will accept both of these, built a DOM for them, allow 
scripting of that DOM, and render on the screen output per the HTML 5 
Recommendation. 

Perhaps all of those are therefore what we mean by legal or clean HTML 5, 
but I don't think so.  (a) seems to me to be legal HTML in a sense that 
(b), for example, is not.  If I wrote an HTML editor and it put out 
content in the form of (b), I hope you'd tell me my editor was buggy, and 
that the tags should be properly nested.

So, that being the case, when there's a language as important as HTML 5, I 
think it's a good thing for there to be a high quality specification that 
makes very clear answers to questions such as:

* What documents are part the language (or legal in the language if you 
prefer) and which ones not?
* What is the correct interpretation of the legal documents?

In short, this would be just a language specification, as distinct from 
the existing HTML 5 draft, which focusses on consuming and rendering HTML 
5 as well as consuming and rendering other input.  Note that, in 
principle, a language specification is not just for authors.  It's a 
specification of what the language >is<.  No doubt, the most common 
consumers of HTML 5 will be browsers, which will be much more liberal in 
what they accept, but the language specification should be referenced by 
anyone who wants to either produce or consume clean, legal, HTML (e.g. no 
badly nested tags).  Usually, such a language specification will say 
nothing about documents like (b) that aren't in the language, except to 
make clear that they aren't.

Since the current HTML 5 draft is focussed to a significant degree on what 
browsers consume, it provides for processing and building DOMs from (a-e). 
 The question I raised in Mandelieu is whether it would be beneficial to 
have a first class, standalone specification for just the legal HTML.  My 
understanding is that Ian and the group in fact plan to attempt to create 
such a specification, primarily by extracting sections from the existing 
draft.  That's good.  If I have a concern, it's that I think having such a 
specification is important, and there is reason to wonder whether 
extracting it semi-automatically from bits of a larger specification will 
produce a high quality result.  As I said in Mandelieu, I think it's fine 
to try, and we'll see how it goes.

> If the TAG wants authors to quote their
> attributes or something like that, then they 
> should just come out and say it :-)

I have no strong opinion on whether unquoted attributes (c) should be 
allowed in the "clean" language, or whether they should be viewed by 
browsers like as more like (b), I.e. poorly formed content that is 
nonetheless processed by browsers.  My guess is that, given the state of 
deployment of HTML, unquoted attributes should be legal in HTML.  It was 
certainly not my intention to argue to the contrary in Mandelieu.  I 
really had in mind examples more like (b) and (e).

As I say, I think I've had a chance to note my concern, and the HTML 
working group has been clear that they want to try this in part by 
extracting from the normative draft, and also by continuing with Lachlan's 
work.  So, I don't particularly care to burn a lot of anyone's energy 
pursuing this at the moment; I hope everyone will give a bit of thought to 
my suggestions and move on down whatever path they deem best.

BTW:  I think Lachlan's draft looks like it's off to a very good start; my 
concern with it, if any, is that I think it's important for a language 
like HTML 5 to have a normative specification, and I understand that 
Lachlan is aiming for a non-normative primer (also very important and 
useful).    Thank you.

Noah

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------
Received on Thursday, 13 November 2008 16:19:56 UTC