Re: Character encodings in headers [i74][was: Straw-man charter for http-bis] from Martin Duerst on 2007-08-20 (ietf-http-wg@w3.org from July to September 2007)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Mon, 20 Aug 2007 16:54:20 +0900
To: Mark Nottingham <mnot@mnot.net>
Cc: Julian Reschke <julian.reschke@gmx.de>, Paul Hoffman <phoffman@imc.org>, Apps Discuss <discuss@apps.ietf.org>, "ietf-http-wg@w3.org Group" <ietf-http-wg@w3.org>, "Richard Ishida" <ishida@w3.org>, Felix Sasaki <fsasaki@w3.org>
Message-Id: <6.0.0.20.2.20070820162657.08bf55a0@localhost>

Hello Mark,

Thanks for giving this an issue number.

At 12:40 07/08/20, Mark Nottingham wrote:
>On 10/06/2007, at 6:05 PM, Martin Duerst wrote:
>> - RFC 2616 prescribes that headers containing non-ASCII have to use
>>   either iso-8859-1 or RFC 2047. This is unnecessarily complex and
>>   not necessarily followed. At the least, new extensions should be
>>   allowed to specify that UTF-8 is used.
>
>My .02;
>
>I'm concerned about allowing UTF-8; it may break existing  
>implementations.
>
>I'd like to see the text just require that the actual character set  
>be 8859-1, but to allow individual extensions to nominate encodings  
>*like* 2047,without being restricted to it.

What do you mean by "encodings *like* 2047"? And why do you think
that UTF-8 may break existing implementations? UTF-8 has virtually
the same footprint in terms of bytes as ISO-8859-1: All bytes
above 0x7F may be used. Implementations that have to deal with
ISO-8859-1 usually do this by just being 8-bit-transparent;
that works for UTF-8, too.

If your opinion is that UTF-8 cannot be allowed at all, then
that's going to be a problem for cases where it's already in
use, see e.g. earlier posts in the http list.

It's easy to say "may break existing implementations",
but in over 10 years of being involved in the Web, I haven't
heard about that happening. If you or anybody have, please
speak up.

>For example, the encoding  
>specified in 3987 is appropriate for URIs.

As one of the authors of RFC 3987, I know what you mean, but
"the encoding specified in 3987" wouldn't be enough for a spec.
Also, it's not really very well suited to the job, because
%hh-encoding is used to escape any bytes, not only UTF-8,
and there is a considerably length increase for some scripts.

>However, it *has* to be  
>explicit; I've heard some people read this requirement and think that  
>they need to check *every* header for 2047 encoding.

I have read it that way, too. If it can be safely argued that
it was never intended that way, and that no harm is produced
if this is restricted, then I'd befine with restricting it,
because checking everything for 2047 is indeed tough, but
I'd really like to make sure that this isn't creating problems.
(i.e. that a careful examination of the various headers makes
some reasonably conservative assumptions).

>So, I think this means;
>
>1) Change
>   "Words of *TEXT MAY contain characters from character sets other  
>than ISO-8859-1 [22] only when encoded according to the rules of RFC  
>2047 [14]."
>to
>   "Words of *TEXT MUST NOT contain characters from character sets  
>other than ISO-885901 [22]."
>and,
>
>2) Identify headers that may have non-8859

There are many parts to ISO-8859, not just ISO-8859-1.

>content and explicitly say  
>how to encode them (IRI, 2047, whatever; the existing ones will have  
>to be 2047, I believe), modifying their BNF to suit.
>
>3) When we document extensibility, require new headers to nominate  
>any encoding explicitly.

If that includes UTF-8, I'd be fine with it. If it excludes
UTF-8, I think that would be a problem.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp

Received on Monday, 20 August 2007 07:56:15 UTC