- From: Francois Yergeau <FYergeau@alis.com>
- Date: Fri, 04 Oct 2002 14:12:04 -0400
- To: ietf-charsets@iana.org
Hi Ira, McDonald, Ira wrote: > PS - Martin - I liked your wording - but I think we're still being > way too permissive of BOM in Internet protocols transferring UTF-8. > There is no sensible reason in an Internet protocol for UTF-8 to > be transferred except as: (1) fixed UTF-8; or (2) labelled UTF-8. In a dreamworld, that would be true. For a reality check, you just have to look at the myriads of HTTP servers where documents are deposited simply by FTP or straight file system copy, with no indication of charset. The consequence is that a *huge* majority of all Web pages are served out with either no charset parameter or a wrong one, in flagrant contradiction with the HTTP spec. Welcome to the real world, where things actually work because browsers sniff inside the pages looking for a non-robust <meta> element that was inserted when there was proper charset information available. > In neither case is leading BOM needed or appropriate. (And I do > not credit strict XML compatibility as a "sensible reason"). XML compatibility does not require a BOM in UTF-8. > > The optional use of leading BOM in UTF-8 (as > > I know Martin said) destroys the crucial property that US-ASCII > > is a perfect subset of UTF-8 and that US-ASCII can pass _without > > harm_ through UTF-8 handling software libraries. > > This totally clashes with my understanding. Can you please > explain how the > existence of the BOM convention in UTF-8 changes anything to the > interpretation of US-ASCII strings that by definition never > contain a BOM? > > <ira> If UTF-8 handling software in Internet protocol implementations > (which should be the ONLY scope of RFC 2279bis) is allowed to > gratuitously > insert a leading BOM - for example, to make sure a (fragile) charset > 'signature' is present in the message or protocol element - then the > perfectly valid original US-ASCII string (for example an IDENTIFIER) > is ruined - this is not hypothetical - I've seen it happen. > </ira> That much is true, but is not the same as what you said before. US-ASCII certainly is a proper subset of UTF-8 and UTF-8 handling software can certainly deal with any US-ASCII string. It should come as no surprise, though, that *ASCII* software cannot deal properly with a BOM-bearing ASCII string (or any other UTF-8 string, or a string in any other charset). But ASCII-only software is not what we're dealing with here, this is about UTF-8. > <ira> But the IETF doesn't care about filesystem metadata > problems. Those > are the domain of POSIX, or Linux, or somebody else. RFC > 2279bis should > focus on acceptable usage of UTF-8 in Internet protocols - period. > </ira> The IETF does care, I believe, about interoperability. Internet protocols do not exist in a vacuum, they interact with people and their computers and their design should (and generally do) take that into account. It would be unwise, IMHO, to ban the BOM outright everywhere. Martin has already suggested a distinction between larger pieces of text ('entities', just the kind of thing that is likely to be saved in a file system) and smaller protocol elements. I have also suggested that individual protocols should decide where to ban and where to allow BOMs; the spec on UTF-8 itself should merely offer guidance on that. Regards, -- François
Received on Friday, 4 October 2002 14:14:08 UTC