Re: Character encodings in headers [i74][was: Straw-man charter for http-bis] from Keith Moore on 2007-08-20 (ietf-http-wg@w3.org from July to September 2007)

From: Keith Moore <moore@cs.utk.edu>
Date: Mon, 20 Aug 2007 15:29:46 -0400
To: der Mouse <mouse@Rodents.Montreal.QC.CA>
CC: discuss@apps.ietf.org, Felix Sasaki <fsasaki@w3.org>, ietf-http-wg@w3.org, Richard Ishida <ishida@w3.org>
Message-ID: <46C9EBAA.8030608@cs.utk.edu>

der Mouse wrote:
>> I think you present a valid scenario.  However, storing headers as
>> iso-8859-1 essentially means storing (and resending) them as bytes.
>>     
>
> Depends on how much checking is done.  The C0 and C1 ranges are not
> valid 8859-x text (except for a few codes in C0, like HT), but, as
> Clive points out, C1 does, in general, occur in UTF-8-encoded text.
>
> I recognize there's a "who would bother to check" tendency.  While I
> share it, I also believe the number of distinct implementations out
> there is large enough that anything permitted by the spec has probably
> been done (and, of course, a great many things not permitted by the
> spec, but I see no reason to care about compatability with them).  In
> particular, any implementation whose native text encoding is not 8859-1
> may be recoding headers into its native encoding for storage and back
> again on output, and that is almost certain to corrupt C1 octets.
I suspect that the problem is not so much transparency, as
presentation.  The larger set of things broken by allowing utf-8 in
existing header fields (and to a lesser extent new fields) will not be
things that forbid C1 octet values, but rather things that try to
display those fields as if they were 8859/1.  Translation of the
presumed 8859/1 into other charsets is another version of the same problem.

Received on Monday, 20 August 2007 19:30:28 UTC