RE: Comments on draft-yergeau-rfc2279bis-00.txt from Mark Davis on 2002-10-04 (ietf-charsets@w3.org from October to December 2002)

From: Mark Davis <mark.davis@us.ibm.com>
Date: Fri, 04 Oct 2002 13:30:59 -0700
To: Francois Yergeau <FYergeau@alis.com>
Cc: ietf-charsets@iana.org
Message-id: <OF1240553F.BCB87626-ON88256C48.00704EEC@us.ibm.com>
I agree with François on these matters. As with UTF-16, we can encourage
protocol writers to always include robust character-encoding declarations,
and in those circumstances discourage the use of BOM. But as a practical
matter, we do have to recognize that BOMs will exist and will be
transmitted, so we have to give guidance in the recognition and proper
handling of them.

Side issue: I want to change the address used for me on this list (to
mark.davis@jtcsv.com). Can someone point me to the instructions for doing
that?

Mark
___
mark.davis@us.ibm.com
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799



                                                                                                                                 
                      Francois Yergeau                                                                                           
                      <FYergeau@alis.co        To:       ietf-charsets@iana.org                                                  
                      m>                       cc:                                                                               
                                               Subject:  RE: Comments on draft-yergeau-rfc2279bis-00.txt                         
                      2002.10.04 11:12                                                                                           
                                                                                                                                 
                                                                                                                                 



Hi Ira,

McDonald, Ira wrote:
> PS - Martin - I liked your wording - but I think we're still being
> way too permissive of BOM in Internet protocols transferring UTF-8.
> There is no sensible reason in an Internet protocol for UTF-8 to
> be transferred except as: (1) fixed UTF-8; or (2) labelled UTF-8.

In a dreamworld, that would be true.  For a reality check, you just have to
look at the myriads of HTTP servers where documents are deposited simply by
FTP or straight file system copy, with no indication of charset.  The
consequence is that a *huge* majority of all Web pages are served out with
either no charset parameter or a wrong one, in flagrant contradiction with
the HTTP spec.  Welcome to the real world, where things actually work
because browsers sniff inside the pages looking for a non-robust <meta>
element that was inserted when there was proper charset information
available.

> In neither case is leading BOM needed or appropriate.  (And I do
> not credit strict XML compatibility as a "sensible reason").

XML compatibility does not require a BOM in UTF-8.

> > The optional use of leading BOM in UTF-8 (as
> > I know Martin said) destroys the crucial property that US-ASCII
> > is a perfect subset of UTF-8 and that US-ASCII can pass _without
> > harm_ through UTF-8 handling software libraries.
>
> This totally clashes with my understanding.  Can you please
> explain how the
> existence of the BOM convention in UTF-8 changes anything to the
> interpretation of US-ASCII strings that by definition never
> contain a BOM?
>
> <ira> If UTF-8 handling software in Internet protocol implementations
> (which should be the ONLY scope of RFC 2279bis) is allowed to
> gratuitously
> insert a leading BOM - for example, to make sure a (fragile) charset
> 'signature' is present in the message or protocol element - then the
> perfectly valid original US-ASCII string (for example an IDENTIFIER)
> is ruined - this is not hypothetical - I've seen it happen.
> </ira>

That much is true, but is not the same as what you said before.  US-ASCII
certainly is a proper subset of UTF-8 and UTF-8 handling software can
certainly deal with any US-ASCII string.  It should come as no surprise,
though, that *ASCII* software cannot deal properly with a BOM-bearing ASCII
string (or any other UTF-8 string, or a string in any other charset).  But
ASCII-only software is not what we're dealing with here, this is about
UTF-8.

> <ira> But the IETF doesn't care about filesystem metadata
> problems.  Those
> are the domain of POSIX, or Linux, or somebody else.  RFC
> 2279bis should
> focus on acceptable usage of UTF-8 in Internet protocols - period.
> </ira>

The IETF does care, I believe, about interoperability.  Internet protocols
do not exist in a vacuum, they interact with people and their computers and
their design should (and generally do) take that into account.

It would be unwise, IMHO, to ban the BOM outright everywhere.  Martin has
already suggested a distinction between larger pieces of text ('entities',
just the kind of thing that is likely to be saved in a file system) and
smaller protocol elements.  I have also suggested that individual protocols
should decide where to ban and where to allow BOMs; the spec on UTF-8
itself
should merely offer guidance on that.

Regards,

--
François
Received on Friday, 4 October 2002 16:32:26 UTC