W3C home > Mailing lists > Public > public-html@w3.org > November 2008

Re: Comments on HTML WG face to face meetings in France Oct 08

From: Jonas Sicking <jonas@sicking.cc>
Date: Fri, 14 Nov 2008 03:07:06 -0800
Message-ID: <63df84f0811140307p18c091c4tf38e80333eab0947@mail.gmail.com>
To: noah_mendelsohn@us.ibm.com
Cc: "Dean Edridge" <dean@dean.org.nz>, public-html <public-html@w3.org>, www-tag@w3.org

On Thu, Nov 13, 2008 at 8:18 AM,  <noah_mendelsohn@us.ibm.com> wrote:
>> Sorry, but I don't get this "clean content" thing.
>
> I don't want to start a long flame war here, as I felt I had a good chance
> to express my feelings at the F2F, but I'll be glad to clarify what I
> intended (speaking for myself, not the TAG).  Let's start with some things
> that I think we all agree.  In particular, HTML5 as drafted provides that
> browsers will accept quite a range of input as text/html.

The XML spec also accepts quite a range of input as text/xml. Most of
it is invalid XML though. Same thing for HTML5. HTML5 is a bit laxer
though due to what it has inherited from the HTML4 specification. I.e.
I don't think we want to make something that was valid HTML4 invalid
HTML5. At least in general.

> For example,
> all of the following will be parsed into DOMs, and presented to users if
> retrieved as text/html:
>
> a) <!-- clearly OK -->
>   <html>
>   <body>
>   <div>
>   <p>Para</p>
>   </div>
>   </body>
>   </html>
>
> b) <html>
>   <body>
>   <div>
>   <p>Para</div>   <!-- note bad nesting of tags -->
>   </p>  <!-- note bad nesting of tags -->
>   </body>
>   </html>
>
> c) <html>
>   <body>
>   <!-- quoted attr -->
>   <img src="http://example.com/img.jpg">
>   </body>
>   </html>
>
> d) <html>
>   <body>
>   <!-- unquoted attr -->
>   <img src=http://example.com/img.jpg>
>   </body>
>   </html>
>
> e>  XXXXXX (Isn't obviously HTML at all,
>            but browser will presumably
>            build a DOM and render XXXXXX)
>
> The best example I have of 'unclean' are (b), in which the close tags are
> in the wrong order, and (e), which has no tags at all.

Disregarding the <title> issue, HTML5 will only consider (a), (c) and
(d) valid. (well, and maybe (e) too if you add the <title> due to all
other tags being optional as per HTML4, not quite sure).

> As far as I know,
> an HTML browser will accept both of these, built a DOM for them, allow
> scripting of that DOM, and render on the screen output per the HTML 5
> Recommendation.

Yes, the difference between HTML5 and XML is the error handling. XML
requires completely bailing out as soon as an error is hit, HTML5
tells the consumer how to create the DOM.

HTML5 is not at all lax in what it considers valid though.

An interesting note about XML error handling is that it doesn't define
what to do with the parts that were parsed before the error was hit.
The result is that different browsers doing different things. So once
consumers implement the HTML5 parsing algorithm, HTML5 should actually
work more consistently across consumers than XML does.

> Perhaps all of those are therefore what we mean by legal or clean HTML 5,
> but I don't think so.  (a) seems to me to be legal HTML in a sense that
> (b), for example, is not.  If I wrote an HTML editor and it put out
> content in the form of (b), I hope you'd tell me my editor was buggy, and
> that the tags should be properly nested.

When you say 'legal', do you mean anything different than 'valid'
(which is the word HTML4 and XML has traditionally used)?

No, HTML5 does not say that all of the above are legal. It follows
what you are suggesting should be legal.

> So, that being the case, when there's a language as important as HTML 5, I
> think it's a good thing for there to be a high quality specification that
> makes very clear answers to questions such as:
>
> * What documents are part the language (or legal in the language if you
> prefer) and which ones not?
> * What is the correct interpretation of the legal documents?

The HTML5 spec currently does define this. Anything that isn't valid
results in one or more parse errors being raised, such as wrongly
nested tags.

> In short, this would be just a language specification, as distinct from
> the existing HTML 5 draft, which focusses on consuming and rendering HTML
> 5 as well as consuming and rendering other input.

Do you really mean 'specification' or 'documentation' here? If
'specification', why isn't documentation enough?

> Note that, in
> principle, a language specification is not just for authors.  It's a
> specification of what the language >is<.  No doubt, the most common
> consumers of HTML 5 will be browsers, which will be much more liberal in
> what they accept, but the language specification should be referenced by
> anyone who wants to either produce or consume clean, legal, HTML (e.g. no
> badly nested tags).  Usually, such a language specification will say
> nothing about documents like (b) that aren't in the language, except to
> make clear that they aren't.

Aah, now we are getting to the crux of the matter!

Why shouldn't a language specification define error handling? This is
an honest question and not meant to be confrontational. The HTML5
draft and you already seem to mostly agree on what is valid and what
is an error, such as badly nested tags. What you are asking for a
specification where error handling is removed. Does that sound
correct?

I agree that having a document that explains how to author HTML is
important. However that does not sound like a specification to me, but
rather documentation. I believe there is precedence for W3C to publish
documentation to aid specifications in the form of Primers and Notes,
that sounds exactly like what we need here.

One problem with having multiple 'specifications' for HTML, one with
and one without error handling, is what do you if they disagree? I.e.
if one of them says that something is invalid and the other says it
isn't? This is why I think it's better to have one specification, but
then supportive non-normative documentation. This way there are no
uncertainties in what is valid.

/ Jonas
Received on Friday, 14 November 2008 11:07:46 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:16:24 GMT