Re: Character encodings in headers [i74][was: Straw-man charter for http-bis]

> I think you present a valid scenario.  However, storing headers as
> iso-8859-1 essentially means storing (and resending) them as bytes.

Depends on how much checking is done.  The C0 and C1 ranges are not
valid 8859-x text (except for a few codes in C0, like HT), but, as
Clive points out, C1 does, in general, occur in UTF-8-encoded text.

I recognize there's a "who would bother to check" tendency.  While I
share it, I also believe the number of distinct implementations out
there is large enough that anything permitted by the spec has probably
been done (and, of course, a great many things not permitted by the
spec, but I see no reason to care about compatability with them).  In
particular, any implementation whose native text encoding is not 8859-1
may be recoding headers into its native encoding for storage and back
again on output, and that is almost certain to corrupt C1 octets.

The only fix I can see for that is to do something like UTF-8, but
tweaked to keep all octets in the ISO-8859-x printable space.  I've ben
unable to come up with a way of doing this by just changing the fixed
bits in UTF-8; it seems to me to require putting only five (rather than
six) bits of data in the second and later octets.  (I suspect this
wouldn't fly, simply because UTF-8 is too entrenched, but it's the only
way I can see to be strictly compatible.  It also has the disadvantage
that part of the BMP needs four octets rather than the three that UTF-8
needs.)

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Received on Monday, 20 August 2007 15:39:26 UTC