RE: Comments on draft-yergeau-rfc2279bis-00.txt

At 14:12 02/10/04 -0400, Francois Yergeau wrote:
>Hi Ira,
>
>McDonald, Ira wrote:
> > PS - Martin - I liked your wording - but I think we're still being
> > way too permissive of BOM in Internet protocols transferring UTF-8.
> > There is no sensible reason in an Internet protocol for UTF-8 to
> > be transferred except as: (1) fixed UTF-8; or (2) labelled UTF-8.
>
>In a dreamworld, that would be true.  For a reality check, you just have to
>look at the myriads of HTTP servers where documents are deposited simply by
>FTP or straight file system copy, with no indication of charset.  The
>consequence is that a *huge* majority of all Web pages are served out with
>either no charset parameter or a wrong one, in flagrant contradiction with
>the HTTP spec.  Welcome to the real world, where things actually work
>because browsers sniff inside the pages looking for a non-robust <meta>
>element that was inserted when there was proper charset information
>available.

The BOM for utf-8 obviously doesn't improve the situation for HTML or XML,
where we have <meta> and encoding=''.
And it doesn't help distinguish between all the many legacy encodings,
which is the real problem. Identifying utf-8 is really extremely easy
compared to that.


> > > The optional use of leading BOM in UTF-8 (as
> > > I know Martin said) destroys the crucial property that US-ASCII
> > > is a perfect subset of UTF-8 and that US-ASCII can pass _without
> > > harm_ through UTF-8 handling software libraries.
> >
> > This totally clashes with my understanding.  Can you please
> > explain how the
> > existence of the BOM convention in UTF-8 changes anything to the
> > interpretation of US-ASCII strings that by definition never
> > contain a BOM?
> >
> > <ira> If UTF-8 handling software in Internet protocol implementations
> > (which should be the ONLY scope of RFC 2279bis) is allowed to
> > gratuitously
> > insert a leading BOM - for example, to make sure a (fragile) charset
> > 'signature' is present in the message or protocol element - then the
> > perfectly valid original US-ASCII string (for example an IDENTIFIER)
> > is ruined - this is not hypothetical - I've seen it happen.
> > </ira>
>
>That much is true, but is not the same as what you said before.  US-ASCII
>certainly is a proper subset of UTF-8 and UTF-8 handling software can
>certainly deal with any US-ASCII string.  It should come as no surprise,
>though, that *ASCII* software cannot deal properly with a BOM-bearing ASCII
>string (or any other UTF-8 string, or a string in any other charset).  But
>ASCII-only software is not what we're dealing with here, this is about
>UTF-8.

In the real world, there is not only 'utf-8 handling software' and
'us-ascii-only software'. First, before we forget it, there is a lot
of software perfectly handling UTF-8, but not being able to deal
with a bom. Second, there is a lot of software that works with 8-bit
data, in particular with utf-8, but where US-ASCII characters are
given special meanings. RFC 2640, among else, mentions this:

    Some of UTF-8's benefits are that it is
    compatible with 7 bit ASCII, so it doesn't affect programs that give
    special meanings to various ASCII characters;

Such software ranges from the very simple cat through all kinds
of scripts, compilers, and so on. For example, a perl script
written to do some simple processing on a HTML file, even if
written only with US-ASCII in mind, will in many cases do the
intended thing for UTF-8 (as well as many, but not all, legacy
encodings). But if you add a BOM at the start of the input file,
or even worse, at the start of the perl script, the chances are
extremely high that things will just blow up. UTF-8 has been
chosen over other contenders (UTF-16, UTF-7,...) by the IETF
exactly because of these benefits. Introducing a BOM removes
a lot of these benefits. This is the real compatibility issue,
not compatibility with 7-bit-only programs.

Another point is of course that if you have Latin-1 and UTF-8
files, why should there be a special marker for UTF-8, which
is the more general encoding, and not for Latin-1? Why should
we have to change each and every program and script to take
the bom into account?


> > <ira> But the IETF doesn't care about filesystem metadata
> > problems.  Those
> > are the domain of POSIX, or Linux, or somebody else.  RFC
> > 2279bis should
> > focus on acceptable usage of UTF-8 in Internet protocols - period.
> > </ira>
>
>The IETF does care, I believe, about interoperability.  Internet protocols
>do not exist in a vacuum, they interact with people and their computers and
>their design should (and generally do) take that into account.

Yes indeed. To keep UTF-8 interact easily with as much software as
possible, the bom has to be avoided.


Regards,    Martin.

Received on Saturday, 5 October 2002 02:15:49 UTC