Re: Comments on "The byte-order mark (BOM) in HTML" from Martin J. Dürst on 2012-12-06 (www-international@w3.org from October to December 2012)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Thu, 06 Dec 2012 13:15:03 +0900
To: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
CC: John Cowan <cowan@mercury.ccil.org>, Norbert Lindenberg <w3@norbertlindenberg.com>, www-international <www-international@w3.org>
Message-ID: <50C01BC7.9000908@it.aoyama.ac.jp>

On 2012/12/06 1:56, Leif Halvard Silli wrote:
> John Cowan, Wed, 5 Dec 2012 11:31:25 -0500:
>> Norbert Lindenberg scripsit:
>>
>>> - "no longer ASCII-compatible": What does this mean? Usually when UTF-8
>>> is described as ASCII-compatible it means that all byte values that
>>> look like ASCII actually are ASCII, and the BOM doesn't break this rule.
>>
>> I take it to mean that UTF-8-encoded text containing only characters from
>> the ASCII repertoire will will be byte-for-byte the same as if it were
>> ASCII-encoded text.  This is true iff the UTF-8 data doesn't have a BOM.
>
> Usually the opposite argument is made, namely that the ASCII repertoire
> is fully UTF-8-compatible.

There are many ways of "ASCII-compatibility". A very weak one is that 
every codepoint in ASCII can be expressed in the encoding in question. 
Most (all?) EBCDIC variants would qualify, but some Japanese encoding 
variants would not (half-width backslash, anybody?).

A slightly stronger one is that the encoding uses the same codepoint for 
ASCII as ASCII itself. That's true even for UTF-16 and UTF-32.

A somewhat stronger one is that all ASCII codepoints are expressed as 
the same ASCII byte value as they are in ASCII itself. EBCDICs won't 
need to apply anymore, but some Japanese encoding variants, even 
variants of Shift_JIS and iso-2022-jp, are okay.

An even stronger condition is that in addition to the above, everything 
that looks like an ASCII byte has to represent the corresponding ASCII 
character. Now we get limited to UTF-8 (including the variant with the 
BOM), EUC-JP, and so on, but lots of single-byte encodings are still 
part of the club.

An even stronger condition is that an ASCII-only file doesn't change 
when encoded with the encoding in question. This keeps in UTF-8 without 
BOM, but kicks out UTF-8 without the BOM.

> It would be nice if it was clarified in the
> text when it is a problem that ASCII + BOM is no longer ASCII. Perhaps
> it relates to Unix tools? The 'UTF-8 and Unicode FAQ for Unix/Linux'
> says that BOM: [1]  "would break far too many existing ASCII syntax
> conventions (such as scripts starting with #!)"
>
> [1] http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux

This is not limited to Unix/Linux. I'm very sure there's tons of tools 
and scripts on Windows that have been written assuming pure ASCII and 
that would get confused in one way or another if they met an UTF-8 BOM.

Regards,   Martin.

Received on Thursday, 6 December 2012 04:16:02 UTC