- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Thu, 06 Dec 2012 13:15:03 +0900
- To: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- CC: John Cowan <cowan@mercury.ccil.org>, Norbert Lindenberg <w3@norbertlindenberg.com>, www-international <www-international@w3.org>
On 2012/12/06 1:56, Leif Halvard Silli wrote: > John Cowan, Wed, 5 Dec 2012 11:31:25 -0500: >> Norbert Lindenberg scripsit: >> >>> - "no longer ASCII-compatible": What does this mean? Usually when UTF-8 >>> is described as ASCII-compatible it means that all byte values that >>> look like ASCII actually are ASCII, and the BOM doesn't break this rule. >> >> I take it to mean that UTF-8-encoded text containing only characters from >> the ASCII repertoire will will be byte-for-byte the same as if it were >> ASCII-encoded text. This is true iff the UTF-8 data doesn't have a BOM. > > Usually the opposite argument is made, namely that the ASCII repertoire > is fully UTF-8-compatible. There are many ways of "ASCII-compatibility". A very weak one is that every codepoint in ASCII can be expressed in the encoding in question. Most (all?) EBCDIC variants would qualify, but some Japanese encoding variants would not (half-width backslash, anybody?). A slightly stronger one is that the encoding uses the same codepoint for ASCII as ASCII itself. That's true even for UTF-16 and UTF-32. A somewhat stronger one is that all ASCII codepoints are expressed as the same ASCII byte value as they are in ASCII itself. EBCDICs won't need to apply anymore, but some Japanese encoding variants, even variants of Shift_JIS and iso-2022-jp, are okay. An even stronger condition is that in addition to the above, everything that looks like an ASCII byte has to represent the corresponding ASCII character. Now we get limited to UTF-8 (including the variant with the BOM), EUC-JP, and so on, but lots of single-byte encodings are still part of the club. An even stronger condition is that an ASCII-only file doesn't change when encoded with the encoding in question. This keeps in UTF-8 without BOM, but kicks out UTF-8 without the BOM. > It would be nice if it was clarified in the > text when it is a problem that ASCII + BOM is no longer ASCII. Perhaps > it relates to Unix tools? The 'UTF-8 and Unicode FAQ for Unix/Linux' > says that BOM: [1] "would break far too many existing ASCII syntax > conventions (such as scripts starting with #!)" > > [1] http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux This is not limited to Unix/Linux. I'm very sure there's tons of tools and scripts on Windows that have been written assuming pure ASCII and that would get confused in one way or another if they met an UTF-8 BOM. Regards, Martin.
Received on Thursday, 6 December 2012 04:16:02 UTC