Re: byte order mark article from Leif Halvard Silli on 2012-11-22 (www-international@w3.org from October to December 2012)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Thu, 22 Nov 2012 02:27:31 +0100
To: John Cowan <cowan@mercury.ccil.org>
Cc: Anne van Kesteren <annevk@annevk.nl>, www-international@w3.org
Message-ID: <20121122022731415889.cb80de5c@xn--mlform-iua.no>

John Cowan, Wed, 21 Nov 2012 18:33:53 -0500:
> Leif Halvard Silli scripsit:
> 
>> UTF-16LE and UTF-16BE theoretical ability to let a leading FF FE or FE 
>> FF represent a ZERO WIDTH NO-BREAK SPACE rather than a BOM, seems to be 
>> withot value for mark-up languages. 
> 
> Well, that's true of XML documents, because their content is always
> preceded and followed by markup.  But this is not necessarily true of
> HTML documents,

For HTML documents, then my statement of course calculated in that in 
HTML, then (but for the BOM!) all illegal codes/characters are moved 
from their place in the code, to where they belong in the HTML DOM. So 
there is definitely no use for zero with no-break space in the start of 
a HTML document. (And since it, in that location, is always interpreted 
as a BOM anyhow, it is a non-issue.)

> nor XML external entities,

An external entity that starts with a ZERO WITH no-BREAK SPACE is not - 
itself - a (well formed) XML document.

> nor LMNL documents.

An "other life form"?!

> Note also that an XML document in UTF16-BE or UTF16-LE must have an
> XML declaration saying so.

That requirement exists only when there is no external protocol: "In 
the absence of external character encoding information (such as MIME 
headers)".[1] And perhaps there is also a text declaration requirement 
whenever, quote, "the replacement text of an external entity is to 
begin with the character U+FEFF". Or did the spec forgot to include 
that, also in this case, the encoding info could come from the external 
protocol?

>  If there is no XML declaration in a 16-bit
> format document, it is necessarily UTF-16, and XML requires a BOM in
> that case.

 First: An external protocol could declare the LE/BE encoding.
Second: When there is an an external declaration which says "UTF-16",
        then the requirement to include a BOM is relaxed. The parser
        could e.g. default to UTF-16LE, as Unicode says.

[1] http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding
-- 
leif halvard silli

Received on Thursday, 22 November 2012 01:28:01 UTC