RE: Comments on draft-yergeau-rfc2279bis-00.txt

At 14:55 02/10/02 -0700, McDonald, Ira wrote:
>Hi,
>
>I can't find Martin Duerst's suggested revisions but...

See http://lists.w3.org/Archives/Public/ietf-charsets/2002AprJun/0042.html

where I wrote:

 >>>>
5. Byte order mark (BOM)

This section needs more work. The 'change log' says that it's
mostly taken from the UTF-16 RFC. But the BOM for UTF-8 is
much less necessary, and much more of a problem, than for UTF-16.
We should clearly say that with IETF protocols, character encodings
are always either labeled or fixed, and therefore the BOM SHOULD
(and MUST at least for small segments) never be used for UTF-8.
And we should clearly give the main argument, namely that it
breaks US-ASCII compatibility (US-ASCII encoded as UTF-8
(without a BOM) stays exactly the same, but US-ASCII encoded
as UTF-8 with a BOM is different).
 >>>>


This doesn't make for actual wording, so I propose to insert the
following at label <37> in
http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-01.txt:

IETF protocols are based on the principle that the character
encoding is either fixed (i.e. a certain field in a certain
protocol is always in the same encoding) or labeled (e.g.
with the 'charset' parameter in the Content-Type header of
a MIME entity). In both cases, the BOM as an encoding identifier
is unnecessary. Also, the compatibility of UTF-8 with US-ASCII
is important in many IETF protocols. However, while UTF-8 without
the BOM is compatible with US-ASCII (encoding an US-ASCII string
in UTF-8 results in exactly the same bytes), UTF-8 with a BOM is
not compatible with US-ASCII, because the BOM is not allowed in
US-ASCII. Also, while a BOM was always allowed for UTF-8 in
ISO 10646, there are many implementations that do not process
a BOM appropriately.

Therefore, senders SHOULD NOT use the BOM in larger, usually
labeled, pieces of text (e.g. MIME entities), and MUST NOT
use it in smaller protocol elements (usually with a fixed
encoding). Receivers SHOULD recognize and remove the BOM
in larger, usually labeled, pieces of text (e.g. MIME entities).


Any comments?      Regards,     Martin.





>This IETF standard should NOT encourage the use of leading BOM in
>streams of UTF-8 text.  The optional use of leading BOM in UTF-8 (as
>I know Martin said) destroys the crucial property that US-ASCII
>is a perfect subset of UTF-8 and that US-ASCII can pass _without
>harm_ through UTF-8 handling software libraries.
>
>Specifically, in the printer industry, the optional presence of
>leading BOM in UTF-8 attribute string values sent over-the-wire
>in the Internet Printing Protocol/1.1 (IPP/1.1, RFC 2910)
>has caused bugs, but has _never_ provided any utility.
>
>The use of detection of leading BOM by software that guesses the
>charset encoding of arbitrary text is pernicious and dangerous.
>
>UTF-8 never needs a 'byte-order' signature.  The concatenation and
>substring extraction bugs inherent in allowing/encouraging leading
>BOM in UTF-8 are serious issues.
>
>Cheers,
>- Ira McDonald (co-editor of Printer MIB v2)
>   High North Inc
>
>
>-----Original Message-----
>From: Patrik F$BgM(Btstr$B‹N(B [mailto:paf@cisco.com]
>Sent: Wednesday, October 02, 2002 5:35 PM
>To: Francois Yergeau
>Cc: ietf-charsets@iana.org; Bert Wijnen
>Subject: Re: Comments on draft-yergeau-rfc2279bis-00.txt
>
>
>On Thursday, September 19, 2002, at 06:49 AM, Francois Yergeau wrote:
>
> > I think I have covered most outstanding comments, with the notable
> > exception of the BOM issue raised by Martin D$B—S(Bst. This one is neither
> > trivial nor uncontroversial, and I have not seen anything ressembling a
> > consensus, so it remains open (no changes to the draft).
>
>[2 weeks have passed again, and I have not seen any comments on this
>list on this]
>
>If anyone agree with Martin changes and text about the BOM issue _IS_
>needed, let me know no later from one week from now (i.e. october 9).
>If I don't see anyone screaming, I declare consensus for this draft,
>and I'll take over from here.
>
>      Thanks to all of you for all help!
>
>          paf

Received on Thursday, 3 October 2002 04:03:25 UTC