- From: <medavis2@us.ibm.com>
- Date: Tue, 02 Feb 1999 13:21:33 -0800
- To: Francois Yergeau <yergeau@alis.com>
- Cc: Larry Masinter <masinter@parc.xerox.com>, "Martin J. Duerst" <duerst@w3.org>, Paul Hoffman / IMC <phoffman@imc.org>, MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>, ietf-charsets@iana.org
- Message-id: <8725670C.007544F0.00@d53mta03h.boulder.ibm.com>
A few comments, marked with *** (since my mailer is deficient)! Francois Yergeau <yergeau@alis.com> on 02/02/99 12:34:14 PM To: Larry Masinter <masinter@parc.xerox.com> cc: "Martin J. Duerst" <duerst@w3.org>, "Paul Hoffman / IMC" <phoffman@imc.org>, MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>, ietf-charsets@iana.org (bcc: Mark Davis/Cupertino/IBM) Subject: RE: draft-hoffman-utf16-01.txt available
À 12:10 02/02/99 -0800, Larry Masinter a écrit : >I think this is the only position consistent with having >three different charset registrations: "BOM should not >be sent with UTF-16BE or UTF-16LE, only with UTF-16." Labelling UTF-16BE (or LE) and then sending a BOM is not inconsistent, it's only redundant. And this redundance can be useful. The explicit label lets the recipient of a MIME object know the endianness without looking inside, which is good. But if the object is then moved elsewhere by a non-MIME protocol (FTP, disk copy, etc.), there is a BOM that the recipient can look at. Since the problem with BOMs is their ambiguousness -- is it a real BOM or an intended ZWNBSP? -- I currently lean toward a "SHOULD NOT put a BOM" unless it's mandatory (such as in XML), in which case it is also unambiguous. *** I disagree (if I understand you correctly). If we have the three labels, then as a sender my role is clear. If the text might come from a source that uses BOM (XML file, Windows file) send as UTF-16. If it doesn't (any other Unicode string!), then I will send UTF-16BE/LE (depending on the polarity). As a receiver, my role is also clear. If I receive UTF-16BE/LE, then any initial <FE,FF> is a real ZWNBSP. If I receive UTF-16, then any initial <FE,FF> is a BE BOM, any initial <FF,FE> is an LE BOM. Let's face it--the BOM is a hack designed to work with systems where text streams are untagged. And unfortunately, it also has an equally valid other semantic (a price of the merger with 10646, since SC2 objected to having a character with only the semantic of the BOM.) Any proposed change to interpret a ZWNBSP as BOM in UTF-16BE/LE just introduces an ambiguity that does not need to be there. The whole reason the Unicode consortium defined the terms UTF-16BE and UTF-16LE was to eliminate ambiguity. *** Martin Dürst: >> We wouldn't have to change XML, only to add a clarification to >> say that "UTF-16" in the XML spec means only the case >> charset="UTF-16", and not the others. That doesn't work. The producer of an XML entity is not necessarily the MIME processor that will tag it, and may not know whether the entity will be tagged UTF-16 or UTF16(BE|LE). Does it put a BOM? And further, I happen to think that all XML entities (in UTF-16) having a BOM is a Good Thing. The XML spec is designed such that one can always determine the character encoding without external info, let's keep it that way. *** Even if XML did not require a BOM, it would not be unambiguous! Look at Appendix F in http://www.xml.com/axml/target.html#sec-guessing. The file would just have to have the initial '<?xml' like all other encodings. To quote: "Because each XML entity not in UTF-8 or UTF-16 format must begin with an XML encoding declaration, in which the first characters must be '<?xml', any conforming processor can detect, after two to four octets of input, which of the following cases apply. In reading this list, it may help to know that in UCS-4, '<' is "#x0000003C" and '?' is "#x0000003F", and the Byte Order Mark required of UTF-16 data streams is "#xFEFF". ... 00 3C 00 3F: UTF-16, big-endian, no Byte Order Mark (and thus, strictly speaking, in error) 3C 00 3F 00: UTF-16, little-endian, no Byte Order Mark (and thus, strictly speaking, in error) ..." *** -- François
Received on Tuesday, 2 February 1999 16:23:32 UTC