Re: Fwd: HTML5 and XHTML2 combined (a new approach) from Benjamin Hawkes-Lewis on 2009-01-26 (www-html@w3.org from January 2009)

From: Benjamin Hawkes-Lewis <bhawkeslewis@googlemail.com>
Date: Mon, 26 Jan 2009 19:22:25 +0000
To: Giovanni Campagna <scampa.giovanni@gmail.com>
CC: www-html@w3.org
Message-ID: <497E0D71.2010901@googlemail.com>
On 26/1/09 12:48, Giovanni Campagna wrote:
> Benjamin Hawkes-Lewis 1)
> if you conversely think that HTML5 and XHTML2 have the same
> destinataries, then you agree that you need one language with both features

I don't think they have the same "destinataries".

For one thing, it's pretty obvious that HTML5 is meant to be deployable 
to popular text/html browsers, and XHTML2 (at least as a whole) is not.

>     What we're discussing is helping implementers, isn't it? If so, then
>     implementation issues are relevant.
>
> Changes to monolithic languages affect all implementation (module based
> or not). Changes to certain modules affect only same implementations

Ah. Is your argument about "change" that XHTML modularization is good 
because if a user agent implements (say) List Module but not 
Metainformation Module, and W3C releases a new version of 
Metainformation Module, then the user agent developers don't have to 
make any changes unless they actually want to implement Metainformation 
Module? That's clearly true, though of course if implementations don't 
implement XHTML as modules this theoretical advantage evaporates in 
practice.

Can you point to an actual example of this saving a user agent developer 
work? Or any other real as opposed to theoretical example of a user 
agent implementor benefiting from XHTML modularization?

>     How does having an implementation of XBL count as implementing
>     href-on-any-element for XHTML 2? Authors being able to fake
>     href-on-any-element with XBL isn't the same thing. Therefore, it
>     would require code changes.
>
> One of XBL2 use-case is to provide UA processing of markup languages. So
> an implementation could actually use XBL2 for action, as it uses CSS for
> styling. But this tangential to the discussion.

A change to Firefox's default CSS _is_ a code change.

So even if Firefox implemented (X)HTML features using XBL 2 (which it 
doesn't), implementing additional features using XBL 2 would still be a 
code change.

>     DOM is the abstract model that serializations express. So if an
>     implementation is parsing a serialization, it's producing a DOM,
>     regardless of whether it supports scripting.
>
> What about SAX parsers? They don't build any DOM. An implementation is
> required to build an Infoset (abstract concept), not a DOM (a set of
> objects implementing certain interfaces)

Hmm. As far as I understand, SAX-esque parsers for text/html have to 
build DOM trees then emit SAX-esque events traversing that tree in order 
to implement the error handling required:

http://hsivonen.iki.fi/introducing-sax-tree/

Does HTML5 impose any definite requirements on a SAX parser of the XML 
serialization that you believe are unnecessary? (And can you cite the 
parts of the draft imposing said requirements?)

>     What is a UA that does not implement scripting required to implement
>     by (unmodularized) HTML5 that you believe such a UA should not be
>     required to implement?

Did you have an answer for this question? It would be helpful if you 
did, since I can imagine such criticisms producing useful changes to the 
specification.

>     Note the differing conformance requirements
>     for differing types of user agent noted in:
>
>     http://www.whatwg.org/specs/web-apps/current-work/#conformance-requirements
>
> What for new user-agents not covered in that section? Will the spec be
> errated any time someone invents a new tool to use with (X)HTML?

Can you imagine a user agent that the section does not cover?

What conformance criteria would you propose to cover such an agent, and 
what text would you use to express those criteria?

>     As for authors, I don't understand your concern. Authors are free to
>     read the parts of documents they are interested in, just as they are
>     free to read the documents they are interested in.
>
> Yes, but it is easier to read only the page containing the feature I
> want, then to search an heavy and difficult document, with the features
> I need spreaded all across and concept I may not have any idea what they
> mean.

Here's what the draft says about it's audience:

http://www.whatwg.org/specs/web-apps/current-work/#audience

"This specification is intended for authors of documents and scripts 
that use the features defined in this specificaton, and implementors of 
tools that are intended to conform to this specification, and 
individuals wishing to establish the correctness of documents or 
implementations with respect to the requirements of this specification.

This document is probably not suited to readers who do not already have 
at least a passing familiarity with Web technologies, as in places it 
sacrifices clarity for precision, and brevity for completeness. More 
approachable tutorials and authoring guides can provide a gentler 
introduction to the topic."

The sort of author you're describing should be reading the official 
authoring guide and unofficial learning materials:

http://dev.w3.org/html5/html-author/

>     The DOM - the abstract document model - into which serialization is
>     parsed, separates syntax from vocabulary in HTML5. Consequently,
>     HTML5 has a text/html serialization, an XML serialization, and could
>     (if one wanted to design one) have additional serializations,
>     including a non-XML SGML serialization. It just wouldn't be
>     realistic to serve them as text/html.
>
> The fact is that the HTML5 parser needs to now about the elements, ie
> the DTD is hardcoded inside the algorithm.

Of course, it does, in order to be able to parse the web corpus. 
Modularization wouldn't change that.

>     Putting the processing rules for text/html in a separate
>     specification from the HTML5 DOM and vocabulary would not really
>     make it technically easier to reuse text/html processing for new
>     vocabularies.
>
> Why?

I can't see how it would make it any easier. If there's a reliable 
error-handling behavior that can be used as an extension point by future 
working groups, then the main specification can describe it just as 
precisely as a separate document. It's the actual existence of the 
reliable extension point - not its description in one document or 
another document - that would make reusing text/html processing for new 
vocabularies plausible.

>     So make smaller releases containing new features that are sensible,
>     settled, and implemented, rather than giant new releases with lots
>     of features. No need for separate specifications to unblock new
>     features.
>
> You need extensibility to add new features without replacing completely
> previous language.

If by "replacing completely previous language", you mean publishing a 
new Recommendation describing a new version of the language, why should 
we avoid publishing such Recommendations?

> HTML5 is by design not extensible

Extensible by whom?

Hixie has mentioned several extension points open to authors, and of 
course W3C is free to extend HTML further in future versions. The 
constraints placed upon W3C's extension of HTML take the form of the 
processing you need to parse the existing web corpus - HTML5 can't 
change that, it can only standardize it as best as it can.

>         Well maybe Image module or Table module needed a new version,
>         but I'm
>         sure that there are features just copied from HTML4 / XHTML1 /
>         DOM2HTML etc.
>
>
>     Leaving aside the algorithms for how to parse text/html streams into
>     a DOM, do you have any example of a module in XHTML1.x that is
>     totally unaltered - other than not being modularized - in HTML5?
>
> Text Module, Text Extension Modules, Form Modules (extended not
> replaced), Intrinsic Events Module, Object Module, Iframe Module,
> Metainformation Module, Scripting Module, Stylesheet Module, Link
> Module, Base Module, Name Indefication Module.

Have you actually looked at the draft for these features and compared 
them carefully with their specifications in XHTML 1.1 Modularization and 
related documents? It seems to me only two modules in your list could 
plausibly be described as "unaltered":

Intrinsic Events module:

HTML5 could reuse this perhaps, though you'd need to create a "Yet More 
Intrinsic Events" module for the additional event attributes it defines 
and additional elements it adds them too.

Style Sheet module:

HTML5 might be able to reuse this, although you'd need to add a new 
module for the "scoped" attribute.

I run through just some of the changes to the other modules below:

Text module:

HTML5 makes "acronym" non-conforming and changes how outline level is 
determined from heading elements.

Text Extension module:

HTML5 changes the proper use of "small" from a presentational effect to 
"small print", changes the proper use of "strong" from strong emphasis 
to importance, provides "semantic fig leaves" for "sup", "sub", "b", and 
"i", and makes "big" and "tt" non-conforming.

Forms module:

HTML5 changes the content model of the "form" element to allow inline 
children and makes "accesskey" non-conforming.

Object module:

HTML5 disallows the "classid", "codebase", "codetype", "declare" and 
"standby" attributes. It also changes the processing of these 
attributes, if present: rather than interpreting "data" as relative to 
"codebase", "codebase" is simply ignored.

Iframe module:

HTML5 disallows the "frameborder", "longdesc", "marginheight", 
"marginwidth", and "scrolling" attributes.

Metainformation module:

HTML5 disallows the "scheme" attribute.

Scripting module:

HTML5 disallows "noscript" in the XML serialization.

Link module:

HTML5 forbids the "rev" attribute. HTML5 requires "target, ping, rel, 
media, hreflang, and type attributes" to "be omitted if the href 
attribute is not present".

Base module:

HTML5 forbids multiple "base" elements in "head".

Name Identification module:

HTML5 disallows "name" on the "a", "iframe", and "img" elements

 > Also Table Module is very
> similar (I actually don't understand why there is a processing model:
> doesn't CSS21 includes the same things?)

The "processing model" is for things like semantically associating table 
cells with table headers. CSS 2.1, being a language for suggesting 
styling, doesn't tell you how to interpret HTML4 syntax, structure, or 
semantics.

>     Do you have any example of a module in XHTML1.x that is unaltered in
>     XHTML2?
>
> Actually unaltered no. Mostly equivalent: Core Attributes, Core Modules
> (many features just moved, with same new features), Table Module (only
> summary added).

"Mostly equivalent" isn't good enough; it means you'd have to release a 
new REC for each module - which is precisely what XHTML 2 is doing.

> No, because, as I already answered to Benjamin, you need to know about
> HTML5 vocabulary in order to implement the HTML5 algorithm.

Hixie says HTML5 separates the two as much as possible, but ultimately 
it is _not_ possible to completely separate text/html processing from 
HTML5 vocabulary for legacy ones. What you can do is define HTML5 
vocabulary independently of text/html processing.

>     However, I don't think it makes sense to apply the HTML syntax to other
>     languages, any more than one should reuse the CSS, JavaScript, or
>     ISO8601
>     syntax for other languages. Do you think we should separate those
>     out for
>     reuse as well?
>
> I was actually thinking of SVG, SMIL, XForms: what the CDF WG is working on.
> For MathML, for example, you needed to modify the algorithm.

Again, for legacy reasons. text/html is not a blank slate technology; 
expecting it to be as flexible as XML is utterly unrealistic. Being able 
to parse the existing web corpus - preserving access to digital culture 
- is (at least arguably) the most important goal of any text/html 
specification.

--
Benjamin Hawkes-Lewis
Received on Monday, 26 January 2009 19:23:51 UTC