- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Thu, 06 Dec 2012 11:54:58 +0100
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: John Cowan <cowan@mercury.ccil.org>, Norbert Lindenberg <w3@norbertlindenberg.com>, www-international <www-international@w3.org>
"Martin J. Dürst", Thu, 06 Dec 2012 13:15:03 +0900: > On 2012/12/06 1:56, Leif Halvard Silli wrote: >> John Cowan, Wed, 5 Dec 2012 11:31:25 -0500: >>> Norbert Lindenberg scripsit: >>> >>>> - "no longer ASCII-compatible": What does this mean? So what does ASCII-compatible mean? You, John and Norbert presented 6 alternatives (I think I myself meant the same as Norbert): 1, file's ASCII subset === ASCII: >>>> that all byte values that look like ASCII actually are ASCII, >>>> and the BOM doesn't break this rule. 2, file === UTF-8’s ASCII subset: >>> I take it to mean that UTF-8-encoded text containing only characters from >>> the ASCII repertoire will will be byte-for-byte the same as if it were >>> ASCII-encoded text. This is true iff the UTF-8 data doesn't have a BOM. 3, file = same chars, code position may deviate: > There are many ways of "ASCII-compatibility". A very weak one is that > every codepoint in ASCII can be expressed in the encoding in > question. Most (all?) EBCDIC variants would qualify, but some > Japanese encoding variants would not (half-width backslash, anybody?). 4, file == the ASCII chars located at ASCII’s code points: > A slightly stronger one is that the encoding uses the same codepoint > for ASCII as ASCII itself. That's true even for UTF-16 and UTF-32. 5, same as 4, but exclusion os UTF-16 and UTF-32: > A somewhat stronger one is that all ASCII codepoints are expressed as > the same ASCII byte value as they are in ASCII itself. EBCDICs won't > need to apply anymore, but some Japanese encoding variants, even > variants of Shift_JIS and iso-2022-jp, are okay. 1, file's ASCII subset === ASCII (=== what Norbert said) > An even stronger condition is that in addition to the above, > everything that looks like an ASCII byte has to represent the > corresponding ASCII character. Now we get limited to UTF-8 (including > the variant with the BOM), EUC-JP, and so on, but lots of single-byte > encodings are still part of the club. 6, file's ASCII subset would have passed ASCII conversion test: > An even stronger condition is that an ASCII-only file doesn't change > when encoded with the encoding in question. This keeps in UTF-8 > without BOM, but kicks out UTF-8 without the BOM. Comments: 1. All these options means that the document should clarify both WHAT it talks about and WHY it talks about it. 2. The WHYs (that is: why ASCII-compatibility is good): For option 6 the goodness is related to being compatible with BOM-adverse tools. Whereas for option 2, then it would probably be related to be compatible with non-ASCII-adverse consumers. But it is unclear to me why, when talking about UTF-8 - a UNICODE format thus meant for a non-ASCII repertoire - we should warn people that BOM makes the file non-ASCII. 3. I think John (option 2) might be closest to what the document means. Thus, that the document tries to warn people who look at a page and see only ASCII - forgetting that there might even be a BOM. But as told above, I don't know why the document would make this an issue/concern - this pure ASCII compatibility would only be a concern in some gotcha cases. 4. Also, authors need to be aware that pure ASCII on the Web doesn't exist: A pure ASCII file which says <meta charset="US-ASCII"/>, would default to the parser’s default encoding (typically Windows-1252). This might also only be a concern in more gotcha cases, but still ... >> It would be nice if it was clarified in the >> text when it is a problem that ASCII + BOM is no longer ASCII. Perhaps >> it relates to Unix tools? The 'UTF-8 and Unicode FAQ for Unix/Linux' >> says that BOM: [1] "would break far too many existing ASCII syntax >> conventions (such as scripts starting with #!)" >> >> [1] http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux > > This is not limited to Unix/Linux. I'm very sure there's tons of > tools and scripts on Windows that have been written assuming pure > ASCII and that would get confused in one way or another if they met > an UTF-8 BOM. So here you seem to be talking about option 2 and not option 6. But if they assume pure ASCII, then it is not enough to remove the BOM. -- leif halvard silli
Received on Thursday, 6 December 2012 10:55:24 UTC