RE: internationalization/ISO10646 question from ned.freed@mrochek.com on 2002-12-06 (ietf-charsets@w3.org from October to December 2002)

From: <ned.freed@mrochek.com>
Date: Thu, 05 Dec 2002 22:06:35 -0800 (PST)
To: Marcin Hanclik <mhanclik@poczta.onet.pl>
Cc: ned.freed@mrochek.com, Martin Duerst <duerst@w3.org>, ietf-charsets@iana.org
Message-id: <01KPOTM5GMEA004VR6@mauve.mrochek.com>

> Hi, Ned!

> Thanks a lot for the mail exchange. I have learned a lot.

> I would like to sum it up since I need a conclusion.

> I am trying to incorporate what You and Martin wrote in Your emails.
> The situation then looks like that:
> I have to send the UCS-2 encoded data. The headers will look like:

> Content-Type: application/x-my-text-subtype; charset="iso-10646-ucs-2"
> Content-Transfer-Encoding: BASE 64

> data

Martin recommended, and I strongly agree, that you use a UTF-16 variant instead.

> My question was:
> Can the data marked as "iso-10646-ucs-2" contain BOM?

> Your answer was:

> > > I don't know if there are specific rules for handling revisions to
> > > iso-10646-ucs-2 or not. I suspect not. However, the general rule is that
> > > additions to a charset repetertoire are expected and allowed. See RFC 2279
> > > section 3. However, the BOM is something of a special case.
> > > ....

> > > For material that isn't labelled with a top level content type of text I don't
> > > think the situation is clear, but the intent has always been to allow additions
> > > to charsets subsequent to registration. So I think BOM should be supported in
> > > this context.

> Wrong in the whole case is that top level content has text type, wrong is
> that WAP/MMS standards have produced a bug in their specs. But we have to
> live with them.

OK, assuming this is true, then the reason for your continued pursuit of the
BOM issue entirely eludes me. Suppose I was able to state definitively that BOM 
can or cannot appear in this context. So what? You say you have to live with
invalid use of text types and you have to live with an inconsistent definition
of the actual charset in use. Isn't support or non-support of BOM also going to
be something you have to live with? And if so, why do you care for a reading
as to what the standards say about it?

> Since Your answer is NOT CLEAR to me (I hope you agree that it can be...) I
> have to derive an answer from the above suggestions.
> But this is still not what I wanted. I would like to have:
> "New standard overrides the old one"
>   or
> "BOM was not defined in ISO10646:1993 and although new versions of ISO10646
> support BOM in UCS-2, data marked as iso-10646-ucs-2 cannot contain BOM"
>   instead of
> "BOM should be supported in this context"

That may be what you want but it isn't something I can provide. Again,
the situation here is:

(1) iso-10646-ucs-2 is not well defined, making it impossible to
    state with authority what the rules for it are
(2) when something isn't well defined serious consideration should be given
    to avoiding it entirely
(3) our general rule is that compatible additions to charsets are allowed,
    but since BOM potentially changes the interpretation of the entire data
    stream it may not be considered as such an addition
(4) independent of whether BOM is "allowed", robust implementations capable
    of handling whatever is thrown at them are usually a good idea.

This is as good as it is going to get.

				Ned

Received on Friday, 6 December 2002 01:33:18 UTC