Re: Comments on "The byte-order mark (BOM) in HTML" from Leif Halvard Silli on 2012-12-05 (www-international@w3.org from October to December 2012)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 05 Dec 2012 17:56:15 +0100
To: John Cowan <cowan@mercury.ccil.org>
Cc: Norbert Lindenberg <w3@norbertlindenberg.com>, www-international <www-international@w3.org>
Message-id: <20121205175615225627.34f4b0ae@xn--mlform-iua.no>

John Cowan, Wed, 5 Dec 2012 11:31:25 -0500:
> Norbert Lindenberg scripsit:
> 
>> - "no longer ASCII-compatible": What does this mean? Usually when UTF-8
>> is described as ASCII-compatible it means that all byte values that
>> look like ASCII actually are ASCII, and the BOM doesn't break this rule.
> 
> I take it to mean that UTF-8-encoded text containing only characters from
> the ASCII repertoire will will be byte-for-byte the same as if it were
> ASCII-encoded text.  This is true iff the UTF-8 data doesn't have a BOM.

Usually the opposite argument is made, namely that the ASCII repertoire 
is fully UTF-8-compatible. It would be nice if it was clarified in the 
text when it is a problem that ASCII + BOM is no longer ASCII. Perhaps 
it relates to Unix tools? The 'UTF-8 and Unicode FAQ for Unix/Linux' 
says that BOM: [1]  "would break far too many existing ASCII syntax 
conventions (such as scripts starting with #!)"

[1] http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux
-- 
leif halvard silli

Received on Wednesday, 5 December 2012 16:56:43 UTC