Re: PROPOSAL: i74: Encoding for non-ASCII headers

Martin Duerst wrote:
> we are supposed to be most interested in is the protocol as implemented.
> Just a few days ago, we have had what's probably the first
> report of some actual iso-8859-1 data (some Spanish with
> a wrong accent). But how much is that in terms of implementations?
> 
> In other words, how much is actually going to break if we allow
> new headers to use UTF-8, and they go ahead and use it? 

I'm in favour of allowing UTF-8.  But we should probably also consider
what recipients should do on recieving _invalid_ UTF-8 in that case.

RFC3629 says:

   Implementations of the decoding algorithm above MUST protect against
   decoding invalid sequences.  For instance, a naive implementation may
   decode the overlong UTF-8 sequence C0 80 into the character U+0000,
   or the surrogate pair ED A1 8C ED BE B4 into U+233B4.  Decoding
   invalid sequences may have security consequences or cause other
   problems.  See Security Considerations (Section 10) below.

Firstly, which part of a recipient is expected to process the UTF-8,
or to pass it through?  And should receipt of invalid UTF-8 be an
error condition, or something which doesn't matter?  From a pure
protocol perspective, that means what should proxies do on receiving
invalid UTF-8.

Secondly, what is invalid UTF-8?  It depends if you're looking at the
ISO-10646 or Unicode definitions.  What about the Java modified UTF-8?
Are some characters disallowed?  (Similar questions apply to IRIs.)

Anything currently sending ISO-8859-1 would almost certainly be
invalid UTF-8.  This is in fact useful.  It is quite common to test
whether a byte sequence is valid UTF-8, and if not, treat it as
ISO-8859-1, because the test is quite effective at distinguishing them
in practice.

So, for informative text (i.e. non protocol) such as text after a
status code, it might be appropriate to recommend that TEXT be parsed
as UTF-8 when valid, and ISO-8859-1 otherwise.

-- Jamie

Received on Thursday, 27 March 2008 16:26:39 UTC