- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Thu, 22 Nov 2012 02:27:31 +0100
- To: John Cowan <cowan@mercury.ccil.org>
- Cc: Anne van Kesteren <annevk@annevk.nl>, www-international@w3.org
John Cowan, Wed, 21 Nov 2012 18:33:53 -0500:
> Leif Halvard Silli scripsit:
>
>> UTF-16LE and UTF-16BE theoretical ability to let a leading FF FE or FE
>> FF represent a ZERO WIDTH NO-BREAK SPACE rather than a BOM, seems to be
>> withot value for mark-up languages.
>
> Well, that's true of XML documents, because their content is always
> preceded and followed by markup. But this is not necessarily true of
> HTML documents,
For HTML documents, then my statement of course calculated in that in
HTML, then (but for the BOM!) all illegal codes/characters are moved
from their place in the code, to where they belong in the HTML DOM. So
there is definitely no use for zero with no-break space in the start of
a HTML document. (And since it, in that location, is always interpreted
as a BOM anyhow, it is a non-issue.)
> nor XML external entities,
An external entity that starts with a ZERO WITH no-BREAK SPACE is not -
itself - a (well formed) XML document.
> nor LMNL documents.
An "other life form"?!
> Note also that an XML document in UTF16-BE or UTF16-LE must have an
> XML declaration saying so.
That requirement exists only when there is no external protocol: "In
the absence of external character encoding information (such as MIME
headers)".[1] And perhaps there is also a text declaration requirement
whenever, quote, "the replacement text of an external entity is to
begin with the character U+FEFF". Or did the spec forgot to include
that, also in this case, the encoding info could come from the external
protocol?
> If there is no XML declaration in a 16-bit
> format document, it is necessarily UTF-16, and XML requires a BOM in
> that case.
First: An external protocol could declare the LE/BE encoding.
Second: When there is an an external declaration which says "UTF-16",
then the requirement to include a BOM is relaxed. The parser
could e.g. default to UTF-16LE, as Unicode says.
[1] http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding
--
leif halvard silli
Received on Thursday, 22 November 2012 01:28:01 UTC