Re: Comments on "The byte-order mark (BOM) in HTML" from Leif Halvard Silli on 2012-12-06 (www-international@w3.org from October to December 2012)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Thu, 06 Dec 2012 11:54:58 +0100
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: John Cowan <cowan@mercury.ccil.org>, Norbert Lindenberg <w3@norbertlindenberg.com>, www-international <www-international@w3.org>
Message-id: <20121206115458056192.b014ad04@xn--mlform-iua.no>

"Martin J. Dürst", Thu, 06 Dec 2012 13:15:03 +0900:
> On 2012/12/06 1:56, Leif Halvard Silli wrote:
>> John Cowan, Wed, 5 Dec 2012 11:31:25 -0500:
>>> Norbert Lindenberg scripsit:
>>> 
>>>> - "no longer ASCII-compatible": What does this mean?

So what does ASCII-compatible mean? You, John and Norbert presented 6 
alternatives (I think I myself meant the same as Norbert):

   1, file's ASCII subset === ASCII:
>>>> that all byte values that look like ASCII actually are ASCII,
>>>> and the BOM doesn't break this rule.

   2, file === UTF-8’s ASCII subset:
>>> I take it to mean that UTF-8-encoded text containing only characters from
>>> the ASCII repertoire will will be byte-for-byte the same as if it were
>>> ASCII-encoded text.  This is true iff the UTF-8 data doesn't have a BOM.

   3, file = same chars, code position may deviate:
> There are many ways of "ASCII-compatibility". A very weak one is that 
> every codepoint in ASCII can be expressed in the encoding in 
> question. Most (all?) EBCDIC variants would qualify, but some 
> Japanese encoding variants would not (half-width backslash, anybody?).

   4, file == the ASCII chars located at ASCII’s code points:
> A slightly stronger one is that the encoding uses the same codepoint 
> for ASCII as ASCII itself. That's true even for UTF-16 and UTF-32.

   5, same as 4, but exclusion os UTF-16 and UTF-32:
> A somewhat stronger one is that all ASCII codepoints are expressed as 
> the same ASCII byte value as they are in ASCII itself. EBCDICs won't 
> need to apply anymore, but some Japanese encoding variants, even 
> variants of Shift_JIS and iso-2022-jp, are okay.

  1, file's ASCII subset === ASCII (=== what Norbert said)
> An even stronger condition is that in addition to the above, 
> everything that looks like an ASCII byte has to represent the 
> corresponding ASCII character. Now we get limited to UTF-8 (including 
> the variant with the BOM), EUC-JP, and so on, but lots of single-byte 
> encodings are still part of the club.

   6, file's ASCII subset would have passed ASCII conversion test:
> An even stronger condition is that an ASCII-only file doesn't change 
> when encoded with the encoding in question. This keeps in UTF-8 
> without BOM, but kicks out UTF-8 without the BOM.

Comments: 

1. All these options means that the document should clarify both WHAT 
it talks about and WHY it talks about it.

2. The WHYs (that is: why ASCII-compatibility is good): For option 6 
the goodness is related to being compatible with BOM-adverse tools. 
Whereas for option 2, then it would probably be related to be 
compatible with non-ASCII-adverse consumers. But it is unclear to me 
why, when talking about UTF-8 - a UNICODE format thus meant for a 
non-ASCII repertoire  - we should warn people that BOM makes the file 
non-ASCII.

3. I think John (option 2) might be closest to what the document means. 
Thus, that the document tries to warn people who look at a page and see 
only ASCII - forgetting that there might even be a BOM. But as told 
above, I don't know why the document would make this an issue/concern - 
this pure ASCII compatibility would only be a concern in some gotcha 
cases.

4. Also, authors need to be aware that pure ASCII on the Web doesn't 
exist: A pure ASCII file which says <meta charset="US-ASCII"/>, would 
default to the parser’s default encoding (typically Windows-1252). This 
might also only be a concern in more gotcha cases, but still ...

>> It would be nice if it was clarified in the
>> text when it is a problem that ASCII + BOM is no longer ASCII. Perhaps
>> it relates to Unix tools? The 'UTF-8 and Unicode FAQ for Unix/Linux'
>> says that BOM: [1]  "would break far too many existing ASCII syntax
>> conventions (such as scripts starting with #!)"
>> 
>> [1] http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux

> 
> This is not limited to Unix/Linux. I'm very sure there's tons of 
> tools and scripts on Windows that have been written assuming pure 
> ASCII and that would get confused in one way or another if they met 
> an UTF-8 BOM.

So here you seem to be talking about option 2 and not option 6. But if 
they assume pure ASCII, then it is not enough to remove the BOM.
-- 
leif halvard silli

Received on Thursday, 6 December 2012 10:55:24 UTC