Re: Comments on draft-yergeau-rfc2279bis-00.txt from Kenneth Whistler on 2002-10-18 (ietf-charsets@w3.org from October to December 2002)

From: Kenneth Whistler <kenw@sybase.com>
Date: Thu, 17 Oct 2002 19:13:43 -0700 (PDT)
To: duerst@w3.org
Cc: ietf-charsets@iana.org
Message-id: <200210180213.TAA23111@birdie.sybase.com>

Martin asked:

> At 09:12 02/10/17 -0700, Markus Scherer wrote:
> >Patrik F$BgM(Btstr$B‹N(B wrote:
> >
> >>What I hear on this list is that the consensus is that BOM SHOULD NOT be 
> >>used. I would like it to be MUST NOT be used in Internet protocols, which 
> >>leads to tagged UTF-8 text be illegal if the BOM exists in the text.
> >
> >
> >That would violate the Unicode standard.
> 
> Hello Markus,
> 
> Can you give the details of why and how (in terms e.g. of conformance
> clauses in the Unicode Standard)?

I think Markus may have overstated the case.

It is certainly possible for an Internet protocol specification
to make BOM-initial UTF-8 text illegal *for that protocol*, if the
relevant protocol definers deem it such. That does not violate
the Unicode Standard, any more than if such a protocol made the
use of the backslash character illegal *for that protocol*.

The Unicode Standard allows BOM-initial UTF-8. The reason it does
so is because encoding conversions should not drop data if
converting between UTF-16 (or UTF-32), which might have an initial
BOM, and UTF-8.

However, the Unicode Standard does not require or recommend the use
of a BOM with UTF-8, since its use as a signature is superfluous
in that encoding form, and as y'all have discussed, if anything, it
is harmful in that context for many protocols and for the ASCII
compatibility of UTF-8 data streams. The exact wording being
considered for the Unicode 4.0 revision is:

  "When represented in UTF-8, the byte order mark turns into
   the byte sequence <EF BB BF>. Its usage at the beginning of
   a UTF-8 data stream is neither required nor recommended by
   the Unicode Standard, but its presence is not considered
   non-conformant for the UTF-8 encoding scheme."

And then there is a bunch more language a bit later about the
care that is necessary when handling BOM's when converting between
encoding schemes.

--Ken

Received on Thursday, 17 October 2002 22:14:42 UTC