Re: HTML 5 and XHTML 2 combined

On 7/1/09 22:13, Mark Birbeck wrote:
> Many
> organisations choose to generate documents that are technically XHTML,
> but deliver them to browsers using an HTML MIME type, so as to 'force'
> the browser to use its HTML parser for rendering.

Yep. They send output in such a way as its processing has no detailed 
conformance requirements, save for those that HTML5 will hopefully provide.

> This gives them the best of both worlds; they can use one or more of
> the enormous number of XML tools around to generate their documents,

Since a serialization to HTML could be appended to any toolchain 
producing XHTML, I cannot agree that serving text/html gives them this 
option.

> but they can still have these documents rendered in existing browsers,
> without having to worry about whether the browser supports XHTML or
> not.

Instead, they need to worry about whether the browser processes their 
particular XHTML acceptably as tag soup. Authors don't have any 
conformance criteria for the subset of XHTML that's processable. And 
this also means they do need to forgo any benefits of having their 
markup processed by browsers as XML, such as mixing in other languages. 
And since HTML 4.01 can be parsed to roughly the same DOM as their XML - 
typically by the same tools - they're not making it substantially easier 
for third-parties to process their content in an automated fashion.

> The important thing here is that this technique also means that in
> principle, even if a 'new' language is created, it could still be
> processed by existing browsers, provided that the new language paid
> attention to HTML processing rules.

Yes, but I didn't mean HTML5 wasn't a new language, I'm saying XHTML 2 
is moving beyond the constraints of the text/html serialization in 
devising a new language.

> So XHTML 2 could be delivered with an HTML MIME type, just as HTML5
> could be delivered with an XHTML MIME type -- in both cases the
> languages are distinct from the delivery mechanism.

Yes. You could deliver any byte stream as text/html.

>> HTML5 is premised on the constraints of supporting the existing web with the
>> same specification; XHTML 2 is premised on ignoring those constraints.
>
> I think this is a little misleading.
>
> First, HTML5 adds new features that are not backwards-compatible with
> HTML 4, but it just so happens that the close relationship between
> some of the browser implementers and the spec writers mean that
> features are being added quite quickly. In effect, the 'existing web'
> is changing, even as we discuss it.
>
> Second, XHTML 2 is not based on ignoring those constraints, although
> it would probably be true to say that it was at its inception.

While HTML adds new features with backwards-compatibility problems, it's 
a requirement that the new features are at least not incompatible with 
the supporting the current web corpus. There's also a general attempt to 
ensure that with a bit of serverside processing you could provide an 
acceptable user experience to most existing user agents. There are some 
exceptions that would require publisher CSS or JS for an acceptable user 
experience (the "hidden" attribute springs to mind, though authors are 
already widely using display: none; to the same effect, as does 
"datagrid", though authors are already creating equivalent features 
using JS). But, as you note, the existing web is changing to implement 
these features (e.g. canvas and video) such that these graceful 
degradation problems will be substantially reduced when HTML5 becomes a 
recommendation.

AFAIK the feedback from browser vendors like Opera seems to be that 
implementing XHTML 2 even in text/html is not compatible with supporting 
the current web corpus. I would of course welcome a correction on this 
point from popular browser vendors. :)

Under "Backwards compatibility", the draft clearly states that XHTML 2's 
element set depends on XML parsing:

"Because earlier versions of HTML were special-purpose languages, it was 
necessary to ensure a level of backwards compatibility with new versions 
so that new documents would still be usable in older browsers. However, 
thanks to XML and style sheets, such strict element-wise backwards 
compatibility is no longer necessary, since an XML-based browser, of 
which at the time of writing means more than 95% of browsers in use, can 
process new markup languages without having to be updated."

http://www.w3.org/TR/2005/WD-xhtml2-20050527/introduction.html#backCompat

If XHTML 2 is not taking advantage of XML to break free of the past, 
perhaps this needs rephrasing?

> For a
> long time now XHTML 2 has had a modular architecture, which means that
> language designers can create languages that use one or more of the
> XHTML 2 modules, and implementers can provide support for whichever
> modules they deem appropriate. This makes XHTML 2 useful not just in
> browsers and constrained devices, but also for creating Docbook-style
> languages, news formats, and so on.

If you mean some XHTML 2 modules could be reconciled with text/html 
processing, that's probably true. The following seem like possible examples:

* XHTML Document Module
* XHTML Structural Module
* XHTML Text Module
* XHTML Hypertext Module
* XHTML I18N Attribute Module
* XHTML Bi-directional Text Attribute Module
* XHTML Role Attribute Module
* Ruby Module
* XHTML Style Attribute Module
* XHTML Tables Module

These basically reflect features in existing text/html implementations. 
Another group of modules introduce new features with (arguably) 
acceptable fallbacks in existing text/html browsers that could perhaps 
be implemented without breaking support for the text/html corpus:

* XHTML List Module
* XHTML Edit Attributes Module
* XHTML Image Map Attributes Module
* XHTML Metainformation Attributes Module
* XHTML Media Attribute Module
* XHTML Style Sheet Module

But there's a whole set of important modules that don't have acceptable 
text/html fallbacks and/or probably couldn't be implemented without 
breaking support for the text/html corpus:

* XHTML Embedding Attributes Module: Undisplayed images are not an 
acceptable fallback, and IIRC browser vendors say it's too hard to 
implement "src" on every element.
* XHTML Handler Module: Displaying raw script on the page is not an 
acceptable fallback.
* XHTML Image Module: Missing alternative texts and alternative text 
displayed beside a visible image is not an acceptable fallback, and 
treating text after an 'img' tag as alternative text is not going to be 
possible to implement alongside supporting the existing web corpus.
* XHTML Hypertext Attributes Module: Links that don't work are not an 
acceptable fallback, and IIRC browser vendors say it's too hard to 
implement "href" on every element.
* XHTML Metainformation Module: text following a "meta" tag is displayed 
in text/html; that's not an acceptable fallback for human-unfriendly 
content and there is likely to be existing content in the corpus that 
depends on this behavior.
* XForms Module: Existing assistive technology cannot associate labels 
with fields, select controls don't work; there are likely further 
practical problems that represent unacceptable failures.
* XHTML Object Module: You would need to use different markup to get 
this working in the most popular browser.
* XML Events Module: Event handling wouldn't work since it depends on 
the Handler Module.

(Doubtless other people's versions of these lists would be different; I 
certainly don't know enough about the implementation problems to provide 
any sort of strong opinion.)

I was talking about XHTML 2 in the round, not individual modules. I 
submit that if XHTML 2 was designed with an eye towards text/html 
compatibility and implementability, (a) these modules would have been 
designed quite differently and (b) their specs would mention differences 
in text/html processing so that authors could avoid certain markup patterns.

These modules include some of biggest changes from HTML4/XHTML1:

http://www.w3.org/TR/xhtml2/introduction.html#s_intro_differences

They are certainly important enough to say you cannot simply produce an 
straightforward "strictly conforming XHTML 2 document", serve it as 
text/html, and expect text/html browsers to provide an acceptable user 
experience for it now, or indeed ever.

I think unacceptably broken forms, scripts, machine data, links, and 
images are sufficient disincentive to trying to serve XHTML 2 as text/html.

On the other hand, the advantages provided by using the remaining XHTML 
2 modules fashion instead of just using the XML serialization of HTML5, 
and then serving the result as text/html, are hard to understand. For 
one thing, the later would give you much the same featureset but working 
better in existing browsers, and for another, the later would include 
forms, scripts, machine data, links, and images that work in text/html.

If the basic idea of XHTML 2 has really stopped being about throwing 
text/html to the winds and taking advantage of XML to rationalize 
document markup and started being about merely paring down HTML into a 
document format and using external XML facilities where ever possible, 
I'd imagine subsetting the following HTML5 features into separate XML 
Schema modules would accomplish 90% of the same use cases with 10% of 
the work for spec-writers, implementors, and authors:

head, title, base, script, body, section, nav, article, aside, h1, h2, 
h3, h4, h5, h6, header, footer, p, pre, dialog, blockquote, ol, ul, li, 
dl, dt, dd, cite, q, em, strong, code, ruby, rt, rp, ins, del, figure, 
object, table, caption, colgroup, col, tbody, thead, tfoot, tr, td, th, 
div, span, object

Specifying, implementing, and learning another language just to use 
"blockcode" instead of "<pre><code>" really doesn't seem worth it. ;)

A "strictly conforming XHTML 2" conformance checker could simply verify 
that the document was using only those features from HTML5, plus any 
other non-HTML XML modules you want to include in XHTML2 (XForms, ARIA, 
RDFa, XLink, XML Events, SVG, MathML, SMIL, whatever).

--
Benjamin Hawkes-Lewis

Received on Thursday, 8 January 2009 13:02:03 UTC