Legacy encodings vs polyglot (Was: Polyglot Markup Formal Objection Rationale)

Lachlan Hunt, Tue, 06 Nov 2012 14:37:17 +0100:

Lachlan, while I understand your angle, I strongly doubt you have 
realized the implications of what you propose. Please see below.

> UTF-8 is not the only encoding that meets those requirements.  A 
> conforming HTML or XHTML document may use UTF-16 with a byte order 
> mark, or any encoding which is declared outside the document (e.g. in 
> the HTTP Content-Type header).

So you would, since it can be declared via HTTP, also allow UTF-16LE 
and UTF-16BE, right? 

But then: What about the fact that, per HTML5, then it is not an error 
to include the BOM in a document that is (externally) labelled as 
UTF-16BE or UTF-16LE? (This is nailed down more directly in the 
Encoding Standard but is present in HTML5 too.) Whereas in XML, it is a 
fatal error to include the BOM if the document is UTF-16BE/UTF-16LE. 
What should polyglot markup say about this? And why? Would you perhaps 
say that, yes, one may use the UTF-16BE/UTF-16LE labels as long as one 
*doesn't* include the BOM? If yes, then wouldn't that kind of go 
against the work of HTML5 and the Encoding Standard on this subject?

And what about using "UTF-16" in an external protocol? Per XML and the 
UTF-16 spec, such a document doesn't need to contain the BOM. But it 
does need to be Big Endian. And HTML5, on its side, strongly agrees. 
With the very important difference that such a document needs to be 
Little Endian! So perhaps forbid the external 'UTF-16' label, except 
when there is a BOM ... ?

By insisting on your very wide (sic) definition of "polyglot", we would 
end up we 3 set of rules: the current MIME rules, the Encoding Standard 
rules and some polyglot subset rules which would send the message that 
"the good old rules still kind of apply!" Thus the polyglot rules that 
you proposes, would sort of do the work of undermining the Encoding 
Standard, I would claim. You would make polyglot markup a stronghold 
for "the old MIME definitions", so to speak. (Well, one could always 
hope for a quick unification of MIME and Encoding Standard et cetera et 
cetera.)

Also, I agree with Anne in that just writing about UTF-16 attracts 
attention to it. [1]

Similar things can be said about the other legacy encodings that HTML5 
(and the Encoding Standard) (re)defines. The simplest example is the 
meaning of 'US-ASCII' in HTML5 vs XML. One  would thus have to sit down 
and figure out a list of polyglot encoding labels. Much work. And much 
double/triple speccing.

To validate such a UTF-16 encoded document for polyglot conformance 
would require more than is offered today. For instance, the NU 
validator issues no warning or error if one declares 'UTF-16' via HTTP 
but omit the BOM. (And it does not seem to matter whether I make the 
document big or little endian.) 

In conclusion: The claim that UTF-8 isn’t the only polyglot encoding is 
in principle correct but in praxis much speccing work of largely 
academic value. And it would please me to hear that you agree that I at 
least have a point.

[1] 
<http://www.w3.org/mid/CADnb78jtaK2mmimamVv+dHEzOizMPJjJpgju7SDavv3DnuHatw@mail.gmail.com>
-- 
leif halvard silli

Received on Friday, 23 November 2012 05:52:23 UTC