Re: Write-up about semantics in HTML5 from A List Apart from Thomas Broyer on 2009-01-07 (public-html@w3.org from January 2009)

From: Thomas Broyer <t.broyer@ltgt.net>
Date: Wed, 7 Jan 2009 17:45:50 +0100
To: public-html <public-html@w3.org>
Message-ID: <a9699fd20901070845j3df99f23mdfbaf85be4382d5@mail.gmail.com>
On Wed, Jan 7, 2009 at 1:43 PM, Julian Reschke wrote:
>
> Ian Hickson wrote:
>>
>> (There are a number of things that XML can't do because of its limitations
>> in extensibility. For example, authors can't extend it to represent non-tree
>> structures, they can't extend it to have error recovery, they can't extend
>> it it to have true multivalued-attributes, they can't extend it to allow
>> them to correctly define validity in the face of namespaces, and they can't
>> extend it to allow them to define validity for non-enumerated attribute
>> values. This isn't a criticism of XML, it's just a description of the design
>> choices made by the XML working group. It's normal for a language to have a
>> constrained extensibility model.)
>
> All true.
>
> But in XML based languages you can extend the vocabulary,

Only when the vocabulary has been defined to be extensible, otherwise
your document won't validate (DTDs do not allow plugging in attributes
other than defined ones and only allow "foreign" child elements when
the content model is ANY; it's almost the same with XML Schema except
you can opt-in for foreign attributes --eventually constrained by
namespace-- and allow foreign child elements while still
validating/constraining other child elements; again almost the same
with RelaxNG, with added expressiveness re. deterministic vs.
ambiguous content models).
As an example, Atom explicitly allows (i.e. not flag as an error) any
attribute and/or element not defined in the spec; and further defines
specific extensibility points (so that "generic" Atom processor could
map those to internal models different from Infoset).

XML in itself does not make vocabularies extensible in any way (even
in the absence of a DTD, processing of an "unknown" attribute or
element, or an unknown attribute value or element content, or
CDATA/PCDATA found where it's not expected, is left totally unspec'd,
they are the responsibility of vocabulary definitions, and most of
them do not allow "foreign content/metadata"; this includes XHTML 1.x
and XHTML 2.0).

What XML allows however (but only when you add Namespaces for XML) is
reusing pieces of already defined vocabularies to build new ones (Open
Document, XHTML 2 reusing XForms, etc.)
(well, it all depends what you call a "vocabulary")

> and this you can't in HTML. At least not the way it's currently defined.

Because XML syntax is "self-expressive", but that's not the case for
HTML right now (it depends on the vocabulary: void elements,
special/scoping/formatting/phrasing elements).

I don't know SGML much but it seems no more different than optional
tags being defined in the DTD: if HTML5 were still SGML-based, when
you'd add a new element (particularly a new "void element": EMPTY
content model with optional end tag), you'd have to update the DTD,
and because no one would download DTDs but use their local catalogs,
you'd have the same deployment problem.

I agree that this is not ideal that any future HTML version (including
HTML5) introducing a new void element would introduce a discrepency in
document processing (HTML6 documents using those new void elements
cannot be used with an HTML5 parser/processor; or at least they may be
parsed to different DOMs).

As already mentionned, one thing we could do is prohibiting
introduction of void elements, but a) as Ian said it would make things
harder to read (<command></command>) and b) it would not address all
use cases, as you would also need to prohibit introduction of new
scoping/formatting/phrasing elements, and that's probably not desired
(we want <section> to auto-close any opened <p>, but for compatibility
with non-HTML5 UAs, authors still have to explicitly close their <p>
before opening a new <section>).
So, as a "compatibility measure", authors would probably have to use
<newvoidelt></newvoidelt>. The HTML5 parsing algorithm should
eventually not flag this as a parse error (I guess it currently is a
parse error); or at least validators flag it as a "warning" or "info"
rather than "error".
But in 20 years from now, all UAs (at least browsers) will probably be
HTML5-compliant and documents produced at that date will be able to
use <newvoidelt> or <newvoidelt/> without fearing incompatibilities.
This is quite the same as the <script><!-- ... //--></script> and
<style><!-- ... --></style> syntax that worked-around old browsers
that would otherwise have shown the script and stylesheet in plain
text, but are now totally useless (and harmful when users try to
switch to XHTML where this will effectively hide things into comments)


Maybe instead of this debate on "theories", we should first
investigate what authors need to do to preserve compatibility when
such new elements are introduced. And we can do this right now without
"predictions" of future needs, as HTML5 introduces void elements
(<command> and <source> come to mind) and "special" elements
(<section> for instance, see above). Moreover, that work is *needed*
for anyone who wants to use those new elements (and omit some optional
tags, in the case of <section>).
It should be quite easy to modify html5lib for those tests to disable
special processing for these elements (falling back to the "any other
start tag" and "any other end tag" cases in the tree builder algorithm
instead).

-- 
Thomas Broyer
Received on Wednesday, 7 January 2009 16:46:31 UTC