- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Fri, 23 Nov 2012 06:51:51 +0100
- To: Lachlan Hunt <lachlan.hunt@lachy.id.au>
- Cc: HTMLwg <public-html@w3.org>
Lachlan Hunt, Tue, 06 Nov 2012 14:37:17 +0100: Lachlan, while I understand your angle, I strongly doubt you have realized the implications of what you propose. Please see below. > UTF-8 is not the only encoding that meets those requirements. A > conforming HTML or XHTML document may use UTF-16 with a byte order > mark, or any encoding which is declared outside the document (e.g. in > the HTTP Content-Type header). So you would, since it can be declared via HTTP, also allow UTF-16LE and UTF-16BE, right? But then: What about the fact that, per HTML5, then it is not an error to include the BOM in a document that is (externally) labelled as UTF-16BE or UTF-16LE? (This is nailed down more directly in the Encoding Standard but is present in HTML5 too.) Whereas in XML, it is a fatal error to include the BOM if the document is UTF-16BE/UTF-16LE. What should polyglot markup say about this? And why? Would you perhaps say that, yes, one may use the UTF-16BE/UTF-16LE labels as long as one *doesn't* include the BOM? If yes, then wouldn't that kind of go against the work of HTML5 and the Encoding Standard on this subject? And what about using "UTF-16" in an external protocol? Per XML and the UTF-16 spec, such a document doesn't need to contain the BOM. But it does need to be Big Endian. And HTML5, on its side, strongly agrees. With the very important difference that such a document needs to be Little Endian! So perhaps forbid the external 'UTF-16' label, except when there is a BOM ... ? By insisting on your very wide (sic) definition of "polyglot", we would end up we 3 set of rules: the current MIME rules, the Encoding Standard rules and some polyglot subset rules which would send the message that "the good old rules still kind of apply!" Thus the polyglot rules that you proposes, would sort of do the work of undermining the Encoding Standard, I would claim. You would make polyglot markup a stronghold for "the old MIME definitions", so to speak. (Well, one could always hope for a quick unification of MIME and Encoding Standard et cetera et cetera.) Also, I agree with Anne in that just writing about UTF-16 attracts attention to it. [1] Similar things can be said about the other legacy encodings that HTML5 (and the Encoding Standard) (re)defines. The simplest example is the meaning of 'US-ASCII' in HTML5 vs XML. One would thus have to sit down and figure out a list of polyglot encoding labels. Much work. And much double/triple speccing. To validate such a UTF-16 encoded document for polyglot conformance would require more than is offered today. For instance, the NU validator issues no warning or error if one declares 'UTF-16' via HTTP but omit the BOM. (And it does not seem to matter whether I make the document big or little endian.) In conclusion: The claim that UTF-8 isn’t the only polyglot encoding is in principle correct but in praxis much speccing work of largely academic value. And it would please me to hear that you agree that I at least have a point. [1] <http://www.w3.org/mid/CADnb78jtaK2mmimamVv+dHEzOizMPJjJpgju7SDavv3DnuHatw@mail.gmail.com> -- leif halvard silli
Received on Friday, 23 November 2012 05:52:23 UTC