RE: Comments on draft-yergeau-rfc2279bis-00.txt from Martin Duerst on 2002-10-04 (ietf-charsets@w3.org from October to December 2002)

From: Martin Duerst <duerst@w3.org>
Date: Fri, 04 Oct 2002 18:56:34 +0900
To: Francois Yergeau <FYergeau@alis.com>, ietf-charsets@iana.org
Message-id: <4.2.0.58.J.20021004184656.05399290@localhost>

At 13:31 02/10/03 -0400, Francois Yergeau wrote:
>McDonald, Ira wrote:
> > This IETF standard should NOT encourage the use of leading BOM in
> > streams of UTF-8 text.
>
>The current text neither encourages nor discourages BOM usage, it only
>points out the existence of the convention and gives some caveats (like the
>uncertainty when stripping a BOM and the possible breakage of digital sigs
>and the like).

As far as I understand most contributions on the list in the past
day or so, the standard should discourage the BOM, but it currently
doesn't.

> > The optional use of leading BOM in UTF-8 (as
> > I know Martin said) destroys the crucial property that US-ASCII
> > is a perfect subset of UTF-8 and that US-ASCII can pass _without
> > harm_ through UTF-8 handling software libraries.
>
>This totally clashes with my understanding.  Can you please explain how the
>existence of the BOM convention in UTF-8 changes anything to the
>interpretation of US-ASCII strings that by definition never contain a BOM?

A sequence of characters taken from the US-ASCII repertoire, when
encoded in either the US-ASCII encoding or the UTF-8 (without BOM)
encoding, leads to exactly the same sequence of bytes. In this sense,
there is full and total compatibility between US-ASCII and UTF-8
(without a BOM). However, if you take a sequence of characters from
the US-ASCII repertoire and encode them in UTF-8 with a leading
BOM, then this, as you correctly point out, is no longer US-ASCII.
This leads to problems for processes accepting things that are
US-ASCII in US-ASCII.

> > UTF-8 never needs a 'byte-order' signature.
>
>This is unfortunately not true, except in the limited realm of properly
>internationalized protocols

As for example IETF protocols.

>with proper implementations

Same for the BOM, of course.

>and no reliance on
>humans to correctly label things.  Which leaves out quite a few things,
>prominent among them file systems: my disks are full of text files in either
>Latin-1, UTF-8 or UTF-16, and the BOM is the only thing that distinguishes
>them.

Wrong. It's very easy to distinguish these three with rather simple
checks (they are a bit more difficult than just checking for the BOM,
but not much).

Regards, Martin.

Received on Friday, 4 October 2002 07:45:00 UTC