RE: Comments on draft-yergeau-rfc2279bis-00.txt from Mark Davis on 2002-10-03 (ietf-charsets@w3.org from October to December 2002)

From: Mark Davis <mark.davis@us.ibm.com>
Date: Wed, 02 Oct 2002 17:17:06 -0700
To: "McDonald, Ira" <imcdonald@sharplabs.com>
Cc: Bert Wijnen <bwijnen@lucent.com>, Francois Yergeau <FYergeau@alis.com>, ietf-charsets@iana.org, "'Patrik Fältström'" <paf@cisco.com>
Message-id: <OF548AAAF4.CF4CEC99-ON88256C47.00012CDD@us.ibm.com>

I agree that it should not be encouraged, but it should be recognized.

The BOM is also not necessary in a 16-bit UTF either; one can explicitly
used UTF-16BE or UTF-16LE; and of course it complicated things. So ideally
BOM would not be used there either. However, BOM in either case is in
widespread usage, and is allowed in UTF-8.

From my perspective, what *would* be very useful would be two have two
distinct tags for UTF-8 data. One that allowed the BOM and one (like
UTF-16BE) that specifically did not. (Of course, whenever you say 'does not
allow the BOM', that really means that an initial U+FEFF is interpreted as
a real character as part of the contents, and not stripped).

Mark
___
mark.davis@us.ibm.com
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

                      "McDonald, Ira"                                                                                                 
                      <imcdonald@sharpl        To:       'Patrik Fältström' <paf@cisco.com>, Francois Yergeau <FYergeau@alis.com>     
                      abs.com>                 cc:       ietf-charsets@iana.org, Bert Wijnen <bwijnen@lucent.com>                     
                                               Subject:  RE: Comments on draft-yergeau-rfc2279bis-00.txt                              
                      2002.10.02 14:55                                                                                                

Hi,

I can't find Martin Duerst's suggested revisions but...

This IETF standard should NOT encourage the use of leading BOM in
streams of UTF-8 text.  The optional use of leading BOM in UTF-8 (as
I know Martin said) destroys the crucial property that US-ASCII
is a perfect subset of UTF-8 and that US-ASCII can pass _without
harm_ through UTF-8 handling software libraries.

Specifically, in the printer industry, the optional presence of
leading BOM in UTF-8 attribute string values sent over-the-wire
in the Internet Printing Protocol/1.1 (IPP/1.1, RFC 2910)
has caused bugs, but has _never_ provided any utility.

The use of detection of leading BOM by software that guesses the
charset encoding of arbitrary text is pernicious and dangerous.

UTF-8 never needs a 'byte-order' signature.  The concatenation and
substring extraction bugs inherent in allowing/encouraging leading
BOM in UTF-8 are serious issues.

Cheers,
- Ira McDonald (co-editor of Printer MIB v2)
  High North Inc

-----Original Message-----
From: Patrik Fältström [mailto:paf@cisco.com]
Sent: Wednesday, October 02, 2002 5:35 PM
To: Francois Yergeau
Cc: ietf-charsets@iana.org; Bert Wijnen
Subject: Re: Comments on draft-yergeau-rfc2279bis-00.txt

On Thursday, September 19, 2002, at 06:49 AM, Francois Yergeau wrote:

> I think I have covered most outstanding comments, with the notable
> exception of the BOM issue raised by Martin Dürst. This one is neither
> trivial nor uncontroversial, and I have not seen anything ressembling a
> consensus, so it remains open (no changes to the draft).

[2 weeks have passed again, and I have not seen any comments on this
list on this]

If anyone agree with Martin changes and text about the BOM issue _IS_
needed, let me know no later from one week from now (i.e. october 9).
If I don't see anyone screaming, I declare consensus for this draft,
and I'll take over from here.

     Thanks to all of you for all help!

         paf

Received on Wednesday, 2 October 2002 20:18:48 UTC