RE: Comments on draft-yergeau-rfc2279bis-00.txt from McDonald, Ira on 2002-10-04 (ietf-charsets@w3.org from October to December 2002)

From: McDonald, Ira <imcdonald@sharplabs.com>
Date: Thu, 03 Oct 2002 20:22:10 -0700
To: "'Francois Yergeau'" <FYergeau@alis.com>, ietf-charsets@iana.org
Message-id: <116DB56CD7DED511BC7800508B2CA53735CDCB@mailsrvnt02.enet.sharplabs.com>

Hi Francois,

Inline comments below.

Cheers,
- Ira McDonald
  High North Inc

PS - Martin - I liked your wording - but I think we're still being
way too permissive of BOM in Internet protocols transferring UTF-8.
There is no sensible reason in an Internet protocol for UTF-8 to
be transferred except as: (1) fixed UTF-8; or (2) labelled UTF-8.
In neither case is leading BOM needed or appropriate.  (And I do
not credit strict XML compatibility as a "sensible reason").

-----Original Message-----
From: Francois Yergeau [mailto:FYergeau@alis.com]
Sent: Thursday, October 03, 2002 1:32 PM
To: ietf-charsets@iana.org
Subject: RE: Comments on draft-yergeau-rfc2279bis-00.txt

McDonald, Ira wrote:
> This IETF standard should NOT encourage the use of leading BOM in
> streams of UTF-8 text.

The current text neither encourages nor discourages BOM usage, it only
points out the existence of the convention and gives some caveats (like the
uncertainty when stripping a BOM and the possible breakage of digital sigs
and the like).

> The optional use of leading BOM in UTF-8 (as
> I know Martin said) destroys the crucial property that US-ASCII
> is a perfect subset of UTF-8 and that US-ASCII can pass _without
> harm_ through UTF-8 handling software libraries.

This totally clashes with my understanding.  Can you please explain how the
existence of the BOM convention in UTF-8 changes anything to the
interpretation of US-ASCII strings that by definition never contain a BOM?

<ira> If UTF-8 handling software in Internet protocol implementations
(which should be the ONLY scope of RFC 2279bis) is allowed to gratuitously
insert a leading BOM - for example, to make sure a (fragile) charset 
'signature' is present in the message or protocol element - then the
perfectly valid original US-ASCII string (for example an IDENTIFIER)
is ruined - this is not hypothetical - I've seen it happen.
</ira>

> UTF-8 never needs a 'byte-order' signature.

This is unfortunately not true, except in the limited realm of properly
internationalized protocols with proper implementations and no reliance on
humans to correctly label things.  Which leaves out quite a few things,
prominent among them file systems: my disks are full of text files in either
Latin-1, UTF-8 or UTF-16, and the BOM is the only thing that distinguishes
them.  Many of those files result from a "Save as" where the original was
properly labelled in some protocol, but the metadata simply gets lost.

<ira> But the IETF doesn't care about filesystem metadata problems.  Those
are the domain of POSIX, or Linux, or somebody else.  RFC 2279bis should
focus on acceptable usage of UTF-8 in Internet protocols - period.
</ira>

-- 
François

Received on Thursday, 3 October 2002 23:23:31 UTC