RE: Comments on draft-yergeau-rfc2279bis-00.txt from Francois Yergeau on 2002-10-04 (ietf-charsets@w3.org from October to December 2002)

From: Francois Yergeau <FYergeau@alis.com>
Date: Fri, 04 Oct 2002 14:12:04 -0400
To: ietf-charsets@iana.org
Message-id: <F7D4BDA0E5A1D14B99D32C022AEB736680CB1B@alis-2k.alis.domain>
Hi Ira,

McDonald, Ira wrote:
> PS - Martin - I liked your wording - but I think we're still being
> way too permissive of BOM in Internet protocols transferring UTF-8.
> There is no sensible reason in an Internet protocol for UTF-8 to
> be transferred except as: (1) fixed UTF-8; or (2) labelled UTF-8.

In a dreamworld, that would be true.  For a reality check, you just have to
look at the myriads of HTTP servers where documents are deposited simply by
FTP or straight file system copy, with no indication of charset.  The
consequence is that a *huge* majority of all Web pages are served out with
either no charset parameter or a wrong one, in flagrant contradiction with
the HTTP spec.  Welcome to the real world, where things actually work
because browsers sniff inside the pages looking for a non-robust <meta>
element that was inserted when there was proper charset information
available.

> In neither case is leading BOM needed or appropriate.  (And I do
> not credit strict XML compatibility as a "sensible reason").

XML compatibility does not require a BOM in UTF-8.

> > The optional use of leading BOM in UTF-8 (as
> > I know Martin said) destroys the crucial property that US-ASCII
> > is a perfect subset of UTF-8 and that US-ASCII can pass _without
> > harm_ through UTF-8 handling software libraries.
> 
> This totally clashes with my understanding.  Can you please 
> explain how the
> existence of the BOM convention in UTF-8 changes anything to the
> interpretation of US-ASCII strings that by definition never 
> contain a BOM?
> 
> <ira> If UTF-8 handling software in Internet protocol implementations
> (which should be the ONLY scope of RFC 2279bis) is allowed to 
> gratuitously
> insert a leading BOM - for example, to make sure a (fragile) charset 
> 'signature' is present in the message or protocol element - then the
> perfectly valid original US-ASCII string (for example an IDENTIFIER)
> is ruined - this is not hypothetical - I've seen it happen.
> </ira>

That much is true, but is not the same as what you said before.  US-ASCII
certainly is a proper subset of UTF-8 and UTF-8 handling software can
certainly deal with any US-ASCII string.  It should come as no surprise,
though, that *ASCII* software cannot deal properly with a BOM-bearing ASCII
string (or any other UTF-8 string, or a string in any other charset).  But
ASCII-only software is not what we're dealing with here, this is about
UTF-8.

> <ira> But the IETF doesn't care about filesystem metadata 
> problems.  Those
> are the domain of POSIX, or Linux, or somebody else.  RFC 
> 2279bis should
> focus on acceptable usage of UTF-8 in Internet protocols - period.
> </ira>

The IETF does care, I believe, about interoperability.  Internet protocols
do not exist in a vacuum, they interact with people and their computers and
their design should (and generally do) take that into account.

It would be unwise, IMHO, to ban the BOM outright everywhere.  Martin has
already suggested a distinction between larger pieces of text ('entities',
just the kind of thing that is likely to be saved in a file system) and
smaller protocol elements.  I have also suggested that individual protocols
should decide where to ban and where to allow BOMs; the spec on UTF-8 itself
should merely offer guidance on that.

Regards,

-- 
François
Received on Friday, 4 October 2002 14:14:08 UTC