RE: The failure of Appendix C as a transition technique (was: Re: Draft Minutes of 2013-02-14 TAG telcon)

The implementors of Appendix C failed to implement it correctly.
Documents delivered as text/html should be parsed as HTML.
Documents delivered as application/xhtml+xml should be parsed as XHTML/XML.

I suppose Appendix C didn't explain it well enough, and perhaps that's a risk for polyglot. 

Perhaps the TAG finding on "Authoritative metadata" needs to be
re-reviewed and made into a consensus Req (and sniffing between
XHTML and HTML disallowed).

> In December of 2000, before the release of Netscape 6, Gecko had an
> HTML parser mode called the Strict DTD. The "DTD" wasn't an SGML DTD.
> Instead, it was a C++ class that implemented the containment rules
> declared in the SGML DTD. Strict DTD threw away a markup that violated
> the HTML 4 Strict containment rules but didn't stop parsing up our
> error.

This doesn't make sense. Why would they do such a thing. And what does
it have to do with XML anyway?

> By August, testing was indicating that the "Strict DTD" parser feature
> was not Web-compatible. But that's another story. The relevant part
> here is that the Strict DTD was being used for XHTML 1.0 Transitional
> served as text/html and was seen as a problem in that context. 

Yes, this makes no sense. The whole point of polyglot is to allow
existing parsers to parse language in the intersection. 

Of course parsing via "Strict DTD" isn't web compatible. Why would
anyone do such a thing. If it says text/html, use a HTML parser.

> David
> Baron argued that this was a forward-compatibility problem with future
> XHTML DTDs/schemas.

Transitional technology should be transitional and not a gateway for
forward compatibility.

> See:

> U/discussion
> (The entire thread is an interesting read with the benefit of
> hindsight. You can see I was still an XHTML believer at that time.)
> The thread resulted in a telecon, where, among other things, it was decided:
> "- Parse XHTML delivered as text/html using the XML content sink with
> an HTML document. (Instead of using the Strict DTD, which we do
> today.)"

This was a serious mistake. Text delivered as text/html should be
parsed as HTML.

> That decision lasted for less than a month. IIRC, it was already too
> late to parse even the front page of O'Reilly's as XML. 

A publisher reacting to a widely distributed but mistaken browser
implementation isn't evidence of anything.

> Hixie
> and dbaron asked the HTML WG what to do:

> (Note: Member-confidential message, but the existence of the message
> is disclosed by the next link in this message. )

The message implies trying to make a decision that didn't make sense.
Things should be parsed as if they are as they are labeled. Perhaps there 
are a few special circumstances where mislabeling is so widespread that
you might want to sniff otherwise, but that should not be the default rule.

> The HTML WG responded (in public) that text/html should be treated as

I agree wholeheartedly.
> And so it has been ever since. Appendix C content wasn't transitioning
> anywhere.

This wasn't the fault of Appendix C but of confusion about how to apply it.
I think polyglot is useful, but only if people don't try to second-guess what
is the publisher's responsibility to label content with a content-type that
is appropriate for parsing the content.


> --
> Henri Sivonen

Received on Friday, 22 February 2013 07:22:37 UTC