Re: Comments on draft-yergeau-rfc2279bis-00.txt from Mark Davis on 2002-10-18 (ietf-charsets@w3.org from October to December 2002)

From: Mark Davis <mark.davis@us.ibm.com>
Date: Thu, 17 Oct 2002 19:45:34 -0700
To: Kenneth Whistler <kenw@sybase.com>
Cc: duerst@w3.org, ietf-charsets@iana.org
Message-id: <OF586BA5DF.B747FF84-ON88256C56.000DFDFA@us.ibm.com>
I think what Markus was referring to was the following.

In the case of UTF-16, we clearly distinguish the tag "UTF-16" from the tag
"UTF-16BE". With "UTF-16", an initial FE FF means a BOM, and is not
logically part of the content, and can be stripped without affecting that
content. With "UTF-16BE", an initial FE FF is a ZWNBSP, *is* logically part
of the content, and *cannot* be stripped without affecting that content.

Very unfortunately, there is no tag difference that lets us know which
variety of UTF-8 we are dealing with. In well-formed text tagged with
"UTF-8", an initial EF BB BF could stand for a BOM, if the system is
interpreting it that way, or it could simply stand for the legitimate
character ZWNBSP. You don't know. You can, of course, guess. And your guess
is pretty likely to be correct. But you don't know for certain. And if you
simply strip a character willy-nilly, and yet purport not to be changing
the text, you are violating Unicode conformance clause C10.

The real way to solve this is to have two distinct tags, one where an
initial EF BB BF stands for a BOM, and one where it doesn't. That way any
protocol can be absolutely clear as to the meaning.

Of course, this is just a guess as to what Markus meant ;-)

Mark
___
mark.davis@us.ibm.com
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799



                                                                                                                           
                      Kenneth Whistler                                                                                     
                      <kenw@sybase.com>        To:       duerst@w3.org                                                     
                                               cc:       ietf-charsets@iana.org                                            
                      2002.10.17 19:13         Subject:  Re: Comments on draft-yergeau-rfc2279bis-00.txt                   
                      Please respond to                                                                                    
                      Kenneth Whistler                                                                                     
                                                                                                                           
                                                                                                                           



Martin asked:

> At 09:12 02/10/17 -0700, Markus Scherer wrote:
> >Patrik F$BgM(Btstr$B‹N(B wrote:
> >
> >>What I hear on this list is that the consensus is that BOM SHOULD NOT
be
> >>used. I would like it to be MUST NOT be used in Internet protocols,
which
> >>leads to tagged UTF-8 text be illegal if the BOM exists in the text.
> >
> >
> >That would violate the Unicode standard.
>
> Hello Markus,
>
> Can you give the details of why and how (in terms e.g. of conformance
> clauses in the Unicode Standard)?

I think Markus may have overstated the case.

It is certainly possible for an Internet protocol specification
to make BOM-initial UTF-8 text illegal *for that protocol*, if the
relevant protocol definers deem it such. That does not violate
the Unicode Standard, any more than if such a protocol made the
use of the backslash character illegal *for that protocol*.

The Unicode Standard allows BOM-initial UTF-8. The reason it does
so is because encoding conversions should not drop data if
converting between UTF-16 (or UTF-32), which might have an initial
BOM, and UTF-8.

However, the Unicode Standard does not require or recommend the use
of a BOM with UTF-8, since its use as a signature is superfluous
in that encoding form, and as y'all have discussed, if anything, it
is harmful in that context for many protocols and for the ASCII
compatibility of UTF-8 data streams. The exact wording being
considered for the Unicode 4.0 revision is:

  "When represented in UTF-8, the byte order mark turns into
   the byte sequence <EF BB BF>. Its usage at the beginning of
   a UTF-8 data stream is neither required nor recommended by
   the Unicode Standard, but its presence is not considered
   non-conformant for the UTF-8 encoding scheme."

And then there is a bunch more language a bit later about the
care that is necessary when handling BOM's when converting between
encoding schemes.

--Ken
Received on Thursday, 17 October 2002 22:46:56 UTC