- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Thu, 22 Nov 2012 02:27:31 +0100
- To: John Cowan <cowan@mercury.ccil.org>
- Cc: Anne van Kesteren <annevk@annevk.nl>, www-international@w3.org
John Cowan, Wed, 21 Nov 2012 18:33:53 -0500: > Leif Halvard Silli scripsit: > >> UTF-16LE and UTF-16BE theoretical ability to let a leading FF FE or FE >> FF represent a ZERO WIDTH NO-BREAK SPACE rather than a BOM, seems to be >> withot value for mark-up languages. > > Well, that's true of XML documents, because their content is always > preceded and followed by markup. But this is not necessarily true of > HTML documents, For HTML documents, then my statement of course calculated in that in HTML, then (but for the BOM!) all illegal codes/characters are moved from their place in the code, to where they belong in the HTML DOM. So there is definitely no use for zero with no-break space in the start of a HTML document. (And since it, in that location, is always interpreted as a BOM anyhow, it is a non-issue.) > nor XML external entities, An external entity that starts with a ZERO WITH no-BREAK SPACE is not - itself - a (well formed) XML document. > nor LMNL documents. An "other life form"?! > Note also that an XML document in UTF16-BE or UTF16-LE must have an > XML declaration saying so. That requirement exists only when there is no external protocol: "In the absence of external character encoding information (such as MIME headers)".[1] And perhaps there is also a text declaration requirement whenever, quote, "the replacement text of an external entity is to begin with the character U+FEFF". Or did the spec forgot to include that, also in this case, the encoding info could come from the external protocol? > If there is no XML declaration in a 16-bit > format document, it is necessarily UTF-16, and XML requires a BOM in > that case. First: An external protocol could declare the LE/BE encoding. Second: When there is an an external declaration which says "UTF-16", then the requirement to include a BOM is relaxed. The parser could e.g. default to UTF-16LE, as Unicode says. [1] http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding -- leif halvard silli
Received on Thursday, 22 November 2012 01:28:01 UTC