- From: Martin Duerst <duerst@w3.org>
- Date: Fri, 04 Oct 2002 18:56:34 +0900
- To: Francois Yergeau <FYergeau@alis.com>, ietf-charsets@iana.org
At 13:31 02/10/03 -0400, Francois Yergeau wrote: >McDonald, Ira wrote: > > This IETF standard should NOT encourage the use of leading BOM in > > streams of UTF-8 text. > >The current text neither encourages nor discourages BOM usage, it only >points out the existence of the convention and gives some caveats (like the >uncertainty when stripping a BOM and the possible breakage of digital sigs >and the like). As far as I understand most contributions on the list in the past day or so, the standard should discourage the BOM, but it currently doesn't. > > The optional use of leading BOM in UTF-8 (as > > I know Martin said) destroys the crucial property that US-ASCII > > is a perfect subset of UTF-8 and that US-ASCII can pass _without > > harm_ through UTF-8 handling software libraries. > >This totally clashes with my understanding. Can you please explain how the >existence of the BOM convention in UTF-8 changes anything to the >interpretation of US-ASCII strings that by definition never contain a BOM? A sequence of characters taken from the US-ASCII repertoire, when encoded in either the US-ASCII encoding or the UTF-8 (without BOM) encoding, leads to exactly the same sequence of bytes. In this sense, there is full and total compatibility between US-ASCII and UTF-8 (without a BOM). However, if you take a sequence of characters from the US-ASCII repertoire and encode them in UTF-8 with a leading BOM, then this, as you correctly point out, is no longer US-ASCII. This leads to problems for processes accepting things that are US-ASCII in US-ASCII. > > UTF-8 never needs a 'byte-order' signature. > >This is unfortunately not true, except in the limited realm of properly >internationalized protocols As for example IETF protocols. >with proper implementations Same for the BOM, of course. >and no reliance on >humans to correctly label things. Which leaves out quite a few things, >prominent among them file systems: my disks are full of text files in either >Latin-1, UTF-8 or UTF-16, and the BOM is the only thing that distinguishes >them. Wrong. It's very easy to distinguish these three with rather simple checks (they are a bit more difficult than just checking for the BOM, but not much). Regards, Martin.
Received on Friday, 4 October 2002 07:45:00 UTC