- From: Martin Duerst <duerst@w3.org>
- Date: Sat, 05 Oct 2002 15:05:55 +0900
- To: Francois Yergeau <FYergeau@alis.com>, ietf-charsets@iana.org
At 14:12 02/10/04 -0400, Francois Yergeau wrote: >Hi Ira, > >McDonald, Ira wrote: > > PS - Martin - I liked your wording - but I think we're still being > > way too permissive of BOM in Internet protocols transferring UTF-8. > > There is no sensible reason in an Internet protocol for UTF-8 to > > be transferred except as: (1) fixed UTF-8; or (2) labelled UTF-8. > >In a dreamworld, that would be true. For a reality check, you just have to >look at the myriads of HTTP servers where documents are deposited simply by >FTP or straight file system copy, with no indication of charset. The >consequence is that a *huge* majority of all Web pages are served out with >either no charset parameter or a wrong one, in flagrant contradiction with >the HTTP spec. Welcome to the real world, where things actually work >because browsers sniff inside the pages looking for a non-robust <meta> >element that was inserted when there was proper charset information >available. The BOM for utf-8 obviously doesn't improve the situation for HTML or XML, where we have <meta> and encoding=''. And it doesn't help distinguish between all the many legacy encodings, which is the real problem. Identifying utf-8 is really extremely easy compared to that. > > > The optional use of leading BOM in UTF-8 (as > > > I know Martin said) destroys the crucial property that US-ASCII > > > is a perfect subset of UTF-8 and that US-ASCII can pass _without > > > harm_ through UTF-8 handling software libraries. > > > > This totally clashes with my understanding. Can you please > > explain how the > > existence of the BOM convention in UTF-8 changes anything to the > > interpretation of US-ASCII strings that by definition never > > contain a BOM? > > > > <ira> If UTF-8 handling software in Internet protocol implementations > > (which should be the ONLY scope of RFC 2279bis) is allowed to > > gratuitously > > insert a leading BOM - for example, to make sure a (fragile) charset > > 'signature' is present in the message or protocol element - then the > > perfectly valid original US-ASCII string (for example an IDENTIFIER) > > is ruined - this is not hypothetical - I've seen it happen. > > </ira> > >That much is true, but is not the same as what you said before. US-ASCII >certainly is a proper subset of UTF-8 and UTF-8 handling software can >certainly deal with any US-ASCII string. It should come as no surprise, >though, that *ASCII* software cannot deal properly with a BOM-bearing ASCII >string (or any other UTF-8 string, or a string in any other charset). But >ASCII-only software is not what we're dealing with here, this is about >UTF-8. In the real world, there is not only 'utf-8 handling software' and 'us-ascii-only software'. First, before we forget it, there is a lot of software perfectly handling UTF-8, but not being able to deal with a bom. Second, there is a lot of software that works with 8-bit data, in particular with utf-8, but where US-ASCII characters are given special meanings. RFC 2640, among else, mentions this: Some of UTF-8's benefits are that it is compatible with 7 bit ASCII, so it doesn't affect programs that give special meanings to various ASCII characters; Such software ranges from the very simple cat through all kinds of scripts, compilers, and so on. For example, a perl script written to do some simple processing on a HTML file, even if written only with US-ASCII in mind, will in many cases do the intended thing for UTF-8 (as well as many, but not all, legacy encodings). But if you add a BOM at the start of the input file, or even worse, at the start of the perl script, the chances are extremely high that things will just blow up. UTF-8 has been chosen over other contenders (UTF-16, UTF-7,...) by the IETF exactly because of these benefits. Introducing a BOM removes a lot of these benefits. This is the real compatibility issue, not compatibility with 7-bit-only programs. Another point is of course that if you have Latin-1 and UTF-8 files, why should there be a special marker for UTF-8, which is the more general encoding, and not for Latin-1? Why should we have to change each and every program and script to take the bom into account? > > <ira> But the IETF doesn't care about filesystem metadata > > problems. Those > > are the domain of POSIX, or Linux, or somebody else. RFC > > 2279bis should > > focus on acceptable usage of UTF-8 in Internet protocols - period. > > </ira> > >The IETF does care, I believe, about interoperability. Internet protocols >do not exist in a vacuum, they interact with people and their computers and >their design should (and generally do) take that into account. Yes indeed. To keep UTF-8 interact easily with as much software as possible, the bom has to be avoided. Regards, Martin.
Received on Saturday, 5 October 2002 02:15:49 UTC