RE: draft-hoffman-utf16-01.txt available from Martin J. Duerst on 1999-02-03 (ietf-charsets@w3.org from January to March 1999)

From: Martin J. Duerst <duerst@w3.org>
Date: Wed, 03 Feb 1999 22:11:12 +0900
To: medavis2@us.ibm.com
Cc: Francois Yergeau <yergeau@alis.com>, Larry Masinter <masinter@parc.xerox.com>, Paul Hoffman / IMC <phoffman@imc.org>, MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>, ietf-charsets@iana.org
Message-id: <199902031756.CAA11537@sh.w3.mag.keio.ac.jp>

At 13:21 99/02/02 -0800, medavis2@us.ibm.com wrote:

> Francois Yergeau <yergeau@alis.com> on 02/02/99 12:34:14 PM

> Since the problem with BOMs is their ambiguousness -- is it a real BOM or
> an intended ZWNBSP? -- I currently lean toward a "SHOULD NOT put a BOM"
> unless it's mandatory (such as in XML), in which case it is also
> unambiguous.

I lean more towards "MUST NOT". There is no requirement from XML on
"UTF-16BE" or "UTF-16LE".

> *** I disagree (if I understand you correctly).
> 
> If we have the three labels, then as a sender my role is clear. If the text
> might come from a source that uses BOM (XML file, Windows file) send as
> UTF-16. If it doesn't (any other Unicode string!), then I will send
> UTF-16BE/LE (depending on the polarity).
> 
> As a receiver, my role is also clear. If I receive UTF-16BE/LE, then any
> initial <FE,FF> is a real ZWNBSP. If I receive UTF-16, then any initial
> <FE,FF> is a BE BOM, any initial <FF,FE> is an LE BOM.

Exactly. If we have differnt labels, but they all more or less mean
the same, that doesn't make sense.

> Martin D��st:
> >> We wouldn't have to change XML, only to add a clarification to
> >> say that "UTF-16" in the XML spec means only the case
> >> charset="UTF-16", and not the others.
> 
> That doesn't work.  The producer of an XML entity is not necessarily the
> MIME processor that will tag it, and may not know whether the entity will
> be tagged UTF-16 or UTF16(BE|LE).  Does it put a BOM?

It puts a BOM or not depending on the environment it is in. On a plain
file system, I personally would put a BOM.
The MIME processor that sends things out should know the environment,
and should either use the appropriate tag (i.e. just "UTF-16" if it's
the file system above), or using its policy and doing the work needed
for that (e.g. stripping off the BOM and adding the approriate tag
("UTF-16BE" or "UTF-16LE")).
The MIME processor has quite a few choices. What's important is that
it knows what it's dealing with, on both sides. That's the same
problem for all other charsets, isn't it?

> *** Even if XML did not require a BOM, it would not be unambiguous! Look at
> Appendix F in
> http://www.xml.com/axml/target.html#sec-guessing. The file would just have
> to have the initial '<?xml' like all other encodings. To quote:
> 
> "Because each XML entity not in UTF-8 or UTF-16 format must begin with an
> XML encoding declaration, in which the first characters must be '<?xml',
> any conforming processor can detect, after two to four octets of input,
> which of the following cases apply. In reading this list, it may help to
> know that in UCS-4, '<' is "#x0000003C" and '?' is "#x0000003F", and the
> Byte Order Mark required of UTF-16 data streams is "#xFEFF".
> 
> ...
> 00 3C 00 3F: UTF-16, big-endian, no Byte Order Mark (and thus, strictly
> speaking, in error)
> 3C 00 3F 00: UTF-16, little-endian, no Byte Order Mark (and thus, strictly
> speaking, in error)
> ..."

Yes, but it's not in error if there is an external label,
because the external label has precedence.

Regards,   Martin.

#-#-#  Martin J. Du"rst, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org

Received on Wednesday, 3 February 1999 13:46:49 UTC