Re: Character encodings in headers [i74][was: Straw-man charter for http-bis] from John C Klensin on 2007-08-20 (ietf-http-wg@w3.org from July to September 2007)

From: John C Klensin <john-ietf@jck.com>
Date: Mon, 20 Aug 2007 07:22:30 +0000
To: Mark Nottingham <mnot@mnot.net>, Martin Duerst <duerst@it.aoyama.ac.jp>
Cc: Richard Ishida <ishida@w3.org>, Apps Discuss <discuss@apps.ietf.org>, Felix Sasaki <fsasaki@w3.org>, "ietf-http-wg@w3.org Group" <ietf-http-wg@w3.org>, Paul Hoffman <phoffman@imc.org>
Message-ID: <157F4F253535B9C73F8EDC75@p3.JCK.COM>

--On Monday, 20 August, 2007 13:40 +1000 Mark Nottingham
<mnot@mnot.net> wrote:

> On 10/06/2007, at 6:05 PM, Martin Duerst wrote:
>> - RFC 2616 prescribes that headers containing non-ASCII have
>> to use either iso-8859-1 or RFC 2047. This is unnecessarily
>>   complex and not necessarily followed. At the least, new
>>   extensions should be allowed to specify that UTF-8 is used.
> 
> My .02;
> 
> I'm concerned about allowing UTF-8; it may break existing
> implementations.

And whatever is done about it should be consistent with the EAI
work.  Otherwise, we are likely to find ourselves in big trouble
going down the line.

> I'd like to see the text just require that the actual
> character set be 8859-1, but to allow individual extensions to
> nominate encodings *like* 2047,without being restricted to it.
> For example, the encoding specified in 3987 is appropriate for
> URIs. However, it *has* to be explicit; I've heard some people
> read this requirement and think that they need to check
> *every* header for 2047 encoding.

Sigh.  My own sense is that, going forward, we need to lose
8859-N, not make it the default (or only) character set for more
protocols.  It is, to put it mildly, a little Euro-centric (and
not even completely suitable for Europe).  Much of the advantage
of Unicode is that one does not need to designate/ nominate a
particular CCS or encoding and then maintain state for it... and
that is a fairly large advantage.  See also
draft-klensin-unicode-escapes-03.txt(probably expired, but you
should be able to find a copy somewhere -- I'll get back to it
sometime soon) for a discussion of issues in ASCII encoding of
multioctet character sets.   The IRI spec may constrain things
to encoding of octets, but that doesn't make it a good idea.

If we are going to consider changes in this area, let's make
them improvements.  Locking in 8859-1 is not an improvement: it
would, IMO, be better to deprecate its use and require explicit
charset designation always if that is the only choice.

     john

Received on Monday, 20 August 2007 15:38:53 UTC