Re: PROPOSAL: i74: Encoding for non-ASCII headers from Jamie Lokier on 2008-03-27 (ietf-http-wg@w3.org from January to March 2008)

From: Jamie Lokier <jamie@shareable.org>
Date: Thu, 27 Mar 2008 16:25:37 +0000
To: Martin Duerst <duerst@it.aoyama.ac.jp>
Cc: "Roy T. Fielding" <fielding@gbiv.com>, Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <20080327162537.GA22889@shareable.org>

Martin Duerst wrote:
> we are supposed to be most interested in is the protocol as implemented.
> Just a few days ago, we have had what's probably the first
> report of some actual iso-8859-1 data (some Spanish with
> a wrong accent). But how much is that in terms of implementations?
> 
> In other words, how much is actually going to break if we allow
> new headers to use UTF-8, and they go ahead and use it? 

I'm in favour of allowing UTF-8.  But we should probably also consider
what recipients should do on recieving _invalid_ UTF-8 in that case.

RFC3629 says:

   Implementations of the decoding algorithm above MUST protect against
   decoding invalid sequences.  For instance, a naive implementation may
   decode the overlong UTF-8 sequence C0 80 into the character U+0000,
   or the surrogate pair ED A1 8C ED BE B4 into U+233B4.  Decoding
   invalid sequences may have security consequences or cause other
   problems.  See Security Considerations (Section 10) below.

Firstly, which part of a recipient is expected to process the UTF-8,
or to pass it through?  And should receipt of invalid UTF-8 be an
error condition, or something which doesn't matter?  From a pure
protocol perspective, that means what should proxies do on receiving
invalid UTF-8.

Secondly, what is invalid UTF-8?  It depends if you're looking at the
ISO-10646 or Unicode definitions.  What about the Java modified UTF-8?
Are some characters disallowed?  (Similar questions apply to IRIs.)

Anything currently sending ISO-8859-1 would almost certainly be
invalid UTF-8.  This is in fact useful.  It is quite common to test
whether a byte sequence is valid UTF-8, and if not, treat it as
ISO-8859-1, because the test is quite effective at distinguishing them
in practice.

So, for informative text (i.e. non protocol) such as text after a
status code, it might be appropriate to recommend that TEXT be parsed
as UTF-8 when valid, and ISO-8859-1 otherwise.

-- Jamie

Received on Thursday, 27 March 2008 16:26:39 UTC