RE: draft-hoffman-utf16-01.txt available from medavis2@us.ibm.com on 1999-02-02 (ietf-charsets@w3.org from January to March 1999)

From: <medavis2@us.ibm.com>
Date: Tue, 02 Feb 1999 13:21:33 -0800
To: Francois Yergeau <yergeau@alis.com>
Cc: Larry Masinter <masinter@parc.xerox.com>, "Martin J. Duerst" <duerst@w3.org>, Paul Hoffman / IMC <phoffman@imc.org>, MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>, ietf-charsets@iana.org
Message-id: <8725670C.007544F0.00@d53mta03h.boulder.ibm.com>

A few comments, marked with *** (since my mailer is deficient)!



Francois Yergeau <yergeau@alis.com> on 02/02/99 12:34:14 PM

To:   Larry Masinter <masinter@parc.xerox.com>
cc:   "Martin J. Duerst" <duerst@w3.org>, "Paul Hoffman / IMC"
      <phoffman@imc.org>, MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>,
      ietf-charsets@iana.org (bcc: Mark Davis/Cupertino/IBM)
Subject:  RE: draft-hoffman-utf16-01.txt available

� 12:10 02/02/99 -0800, Larry Masinter a �crit :
>I think this is the only position consistent with having
>three different charset registrations: "BOM should not
>be sent with UTF-16BE or UTF-16LE, only with UTF-16."

Labelling UTF-16BE (or LE) and then sending a BOM is not inconsistent, it's
only redundant.

And this redundance can be useful.  The explicit label lets the recipient
of a MIME object know the endianness without looking inside, which is good.
 But if the object is then moved elsewhere by a non-MIME protocol (FTP,
disk copy, etc.), there is a BOM that the recipient can look at.

Since the problem with BOMs is their ambiguousness -- is it a real BOM or
an intended ZWNBSP? -- I currently lean toward a "SHOULD NOT put a BOM"
unless it's mandatory (such as in XML), in which case it is also
unambiguous.

*** I disagree (if I understand you correctly).

If we have the three labels, then as a sender my role is clear. If the text
might come from a source that uses BOM (XML file, Windows file) send as
UTF-16. If it doesn't (any other Unicode string!), then I will send
UTF-16BE/LE (depending on the polarity).

As a receiver, my role is also clear. If I receive UTF-16BE/LE, then any
initial <FE,FF> is a real ZWNBSP. If I receive UTF-16, then any initial
<FE,FF> is a BE BOM, any initial <FF,FE> is an LE BOM.

Let's face it--the BOM is a hack designed to work with systems where text
streams are untagged. And unfortunately, it also has an equally valid other
semantic (a price of the merger with 10646, since SC2 objected to having a
character with only the semantic of the BOM.)

Any proposed change to interpret a ZWNBSP as BOM in UTF-16BE/LE just
introduces an ambiguity that does not need to be there. The whole reason
the Unicode consortium defined the terms UTF-16BE and UTF-16LE was to
eliminate ambiguity.
***

Martin D�rst:
>> We wouldn't have to change XML, only to add a clarification to
>> say that "UTF-16" in the XML spec means only the case
>> charset="UTF-16", and not the others.

That doesn't work.  The producer of an XML entity is not necessarily the
MIME processor that will tag it, and may not know whether the entity will
be tagged UTF-16 or UTF16(BE|LE).  Does it put a BOM?

And further, I happen to think that all XML entities (in UTF-16) having a
BOM is a Good Thing.  The XML spec is designed such that one can always
determine the character encoding without external info, let's keep it that
way.

*** Even if XML did not require a BOM, it would not be unambiguous! Look at
Appendix F in
http://www.xml.com/axml/target.html#sec-guessing. The file would just have
to have the initial '<?xml' like all other encodings. To quote:

"Because each XML entity not in UTF-8 or UTF-16 format must begin with an
XML encoding declaration, in which the first characters must be '<?xml',
any conforming processor can detect, after two to four octets of input,
which of the following cases apply. In reading this list, it may help to
know that in UCS-4, '<' is "#x0000003C" and '?' is "#x0000003F", and the
Byte Order Mark required of UTF-16 data streams is "#xFEFF".

...
00 3C 00 3F: UTF-16, big-endian, no Byte Order Mark (and thus, strictly
speaking, in error)
3C 00 3F 00: UTF-16, little-endian, no Byte Order Mark (and thus, strictly
speaking, in error)
..."
***


--
Fran�ois

Received on Tuesday, 2 February 1999 16:23:32 UTC