- From: McDonald, Ira <imcdonald@sharplabs.com>
- Date: Thu, 03 Oct 2002 20:22:10 -0700
- To: "'Francois Yergeau'" <FYergeau@alis.com>, ietf-charsets@iana.org
Hi Francois, Inline comments below. Cheers, - Ira McDonald High North Inc PS - Martin - I liked your wording - but I think we're still being way too permissive of BOM in Internet protocols transferring UTF-8. There is no sensible reason in an Internet protocol for UTF-8 to be transferred except as: (1) fixed UTF-8; or (2) labelled UTF-8. In neither case is leading BOM needed or appropriate. (And I do not credit strict XML compatibility as a "sensible reason"). -----Original Message----- From: Francois Yergeau [mailto:FYergeau@alis.com] Sent: Thursday, October 03, 2002 1:32 PM To: ietf-charsets@iana.org Subject: RE: Comments on draft-yergeau-rfc2279bis-00.txt McDonald, Ira wrote: > This IETF standard should NOT encourage the use of leading BOM in > streams of UTF-8 text. The current text neither encourages nor discourages BOM usage, it only points out the existence of the convention and gives some caveats (like the uncertainty when stripping a BOM and the possible breakage of digital sigs and the like). > The optional use of leading BOM in UTF-8 (as > I know Martin said) destroys the crucial property that US-ASCII > is a perfect subset of UTF-8 and that US-ASCII can pass _without > harm_ through UTF-8 handling software libraries. This totally clashes with my understanding. Can you please explain how the existence of the BOM convention in UTF-8 changes anything to the interpretation of US-ASCII strings that by definition never contain a BOM? <ira> If UTF-8 handling software in Internet protocol implementations (which should be the ONLY scope of RFC 2279bis) is allowed to gratuitously insert a leading BOM - for example, to make sure a (fragile) charset 'signature' is present in the message or protocol element - then the perfectly valid original US-ASCII string (for example an IDENTIFIER) is ruined - this is not hypothetical - I've seen it happen. </ira> > UTF-8 never needs a 'byte-order' signature. This is unfortunately not true, except in the limited realm of properly internationalized protocols with proper implementations and no reliance on humans to correctly label things. Which leaves out quite a few things, prominent among them file systems: my disks are full of text files in either Latin-1, UTF-8 or UTF-16, and the BOM is the only thing that distinguishes them. Many of those files result from a "Save as" where the original was properly labelled in some protocol, but the metadata simply gets lost. <ira> But the IETF doesn't care about filesystem metadata problems. Those are the domain of POSIX, or Linux, or somebody else. RFC 2279bis should focus on acceptable usage of UTF-8 in Internet protocols - period. </ira> -- François
Received on Thursday, 3 October 2002 23:23:31 UTC