RE: draft-hoffman-utf16-01.txt available from Francois Yergeau on 1999-02-03 (ietf-charsets@w3.org from January to March 1999)

From: Francois Yergeau <yergeau@alis.com>
Date: Tue, 02 Feb 1999 22:14:16 -0500
To: medavis2@us.ibm.com
Cc: Larry Masinter <masinter@parc.xerox.com>, "Martin J. Duerst" <duerst@w3.org>, Paul Hoffman / IMC <phoffman@imc.org>, MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>, ietf-charsets@iana.org
Message-id: <3.0.5.32.19990202221416.009ae330@www.alis.com>

À 13:21 02/02/99 -0800, medavis2@us.ibm.com a écrit :
>Francois Yergeau <yergeau@alis.com> on 02/02/99 12:34:14 PM
>>Since the problem with BOMs is their ambiguousness -- is it a real BOM or
>>an intended ZWNBSP? -- I currently lean toward a "SHOULD NOT put a BOM"
>>unless it's mandatory (such as in XML), in which case it is also
>>unambiguous.
>
>*** I disagree (if I understand you correctly).
>
>If we have the three labels, then as a sender my role is clear. If the text
>might come from a source that uses BOM (XML file, Windows file) send as
>UTF-16.

Two problems here: 1) use UTF-16 only if you *know* that the source has a
BOM or that it's BE, otherwise you might send LE without a BOM; "might
come" is not enough; not every Windows file will have a BOM, it depends on
the creating application.  2) Are we sure  that we want to drop the more
specific BE|LE tags, and thereby force the receiver to peek inside the
object, just because we know there is a BOM?  I'm not convinced yet.

> If it doesn't (any other Unicode string!), then I will send
>UTF-16BE/LE (depending on the polarity).

And perhaps generate illegal stuff, because there turns out to be a BOM
after all.  I'm questionning the enforceability of forbidding BOMs.  Let's
not make a rule that cannot be obeyed by reasonable implementations in most
cases.

>As a receiver, my role is also clear. If I receive UTF-16BE/LE, then any
>initial <FE,FF> is a real ZWNBSP.

...is purported to be a real ZWNBSP.

> If I receive UTF-16, then any initial
><FE,FF> is a BE BOM, any initial <FF,FE> is an LE BOM.

No argument here.

>Let's face it--the BOM is a hack designed to work with systems where text
>streams are untagged. And unfortunately, it also has an equally valid other
>semantic (a price of the merger with 10646, since SC2 objected to having a
>character with only the semantic of the BOM.)

Sigh.  Too true.  Things would be much simpler if a BOM could only be a
BOM.  Too late.

Facing this, there doesn't seem to be any ideal solution, guaranteed to
always work. We have to decide on the best compromise.  I think it's
slightly better not to forbid BOMs completely with the BE/LE tags, but I'm
willing to go for forbidding them if that's the consensus.  At this point,
the most important thing is to register those $%?&* tags in a workable
manner and make UTF-16 useable on the Internet.


One last thing:
>*** Even if XML did not require a BOM, it would not be unambiguous! Look at
>Appendix F in
>http://www.xml.com/axml/target.html#sec-guessing. The file would just have
>to have the initial '<?xml' like all other encodings.

Not necessarily.  The initial '<?xml' is not required for UTF-16 (and
UTF-8) XML entities.  Such entities may begin with a comment, white space
or an element start tag.  It's the mandated BOM that makes UTF-16 entities
auto-detectable.   Requiring the initial '<?xml' would be an incompatible
change to the XML spec, invalidating existing data and applications.



-- 
François

Received on Friday, 5 February 1999 19:25:09 UTC