Legacy encodings vs polyglot (Was: Polyglot Markup Formal Objection Rationale) from Leif Halvard Silli on 2012-11-23 (public-html@w3.org from November 2012)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Fri, 23 Nov 2012 06:51:51 +0100
To: Lachlan Hunt <lachlan.hunt@lachy.id.au>
Cc: HTMLwg <public-html@w3.org>
Message-ID: <20121123065151730553.7b5fb41e@xn--mlform-iua.no>

Lachlan Hunt, Tue, 06 Nov 2012 14:37:17 +0100:

Lachlan, while I understand your angle, I strongly doubt you have
realized the implications of what you propose. Please see below.

> UTF-8 is not the only encoding that meets those requirements. A
> conforming HTML or XHTML document may use UTF-16 with a byte order
> mark, or any encoding which is declared outside the document (e.g. in
> the HTTP Content-Type header).

So you would, since it can be declared via HTTP, also allow UTF-16LE
and UTF-16BE, right?

But then: What about the fact that, per HTML5, then it is not an error
to include the BOM in a document that is (externally) labelled as
UTF-16BE or UTF-16LE? (This is nailed down more directly in the
Encoding Standard but is present in HTML5 too.) Whereas in XML, it is a
fatal error to include the BOM if the document is UTF-16BE/UTF-16LE.
What should polyglot markup say about this? And why? Would you perhaps
say that, yes, one may use the UTF-16BE/UTF-16LE labels as long as one
*doesn't* include the BOM? If yes, then wouldn't that kind of go
against the work of HTML5 and the Encoding Standard on this subject?

And what about using "UTF-16" in an external protocol? Per XML and the
UTF-16 spec, such a document doesn't need to contain the BOM. But it
does need to be Big Endian. And HTML5, on its side, strongly agrees.
With the very important difference that such a document needs to be
Little Endian! So perhaps forbid the external 'UTF-16' label, except
when there is a BOM ... ?

By insisting on your very wide (sic) definition of "polyglot", we would
end up we 3 set of rules: the current MIME rules, the Encoding Standard
rules and some polyglot subset rules which would send the message that
"the good old rules still kind of apply!" Thus the polyglot rules that
you proposes, would sort of do the work of undermining the Encoding
Standard, I would claim. You would make polyglot markup a stronghold
for "the old MIME definitions", so to speak. (Well, one could always
hope for a quick unification of MIME and Encoding Standard et cetera et
cetera.)

Also, I agree with Anne in that just writing about UTF-16 attracts
attention to it. [1]

Similar things can be said about the other legacy encodings that HTML5
(and the Encoding Standard) (re)defines. The simplest example is the
meaning of 'US-ASCII' in HTML5 vs XML. One would thus have to sit down
and figure out a list of polyglot encoding labels. Much work. And much
double/triple speccing.

To validate such a UTF-16 encoded document for polyglot conformance
would require more than is offered today. For instance, the NU
validator issues no warning or error if one declares 'UTF-16' via HTTP
but omit the BOM. (And it does not seem to matter whether I make the
document big or little endian.)

In conclusion: The claim that UTF-8 isn’t the only polyglot encoding is
in principle correct but in praxis much speccing work of largely
academic value. And it would please me to hear that you agree that I at
least have a point.

[1]
<http://www.w3.org/mid/CADnb78jtaK2mmimamVv+dHEzOizMPJjJpgju7SDavv3DnuHatw@mail.gmail.com>
--
leif halvard silli

Received on Friday, 23 November 2012 05:52:23 UTC