Re: PROPOSAL: i74: Encoding for non-ASCII headers from Martin Duerst on 2008-03-28 (ietf-http-wg@w3.org from January to March 2008)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Fri, 28 Mar 2008 15:14:11 +0900
To: Jamie Lokier <jamie@shareable.org>
Cc: "Roy T. Fielding" <fielding@gbiv.com>, Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Message-Id: <6.0.0.20.2.20080328150509.0662b2e0@localhost>

At 01:25 08/03/28, Jamie Lokier wrote:

>I'm in favour of allowing UTF-8.  But we should probably also consider
>what recipients should do on recieving _invalid_ UTF-8 in that case.
>
>RFC3629 says:
>
>   Implementations of the decoding algorithm above MUST protect against
>   decoding invalid sequences.  For instance, a naive implementation may
>   decode the overlong UTF-8 sequence C0 80 into the character U+0000,
>   or the surrogate pair ED A1 8C ED BE B4 into U+233B4.  Decoding
>   invalid sequences may have security consequences or cause other
>   problems.  See Security Considerations (Section 10) below.
>
>Firstly, which part of a recipient is expected to process the UTF-8,
>or to pass it through?  And should receipt of invalid UTF-8 be an
>error condition, or something which doesn't matter?  From a pure
>protocol perspective, that means what should proxies do on receiving
>invalid UTF-8.

Very clear: Pass things through. As RFC 2616 allows RFC 2047 encoding,
and that already allows UTF-8, we already have that problem, and I
don't think any proxy is doing anything else than pass things through.
Checking is only necessary upon interpretation (e.g. conversion to
other encoding, use for a filename,...).

>Secondly, what is invalid UTF-8?  It depends if you're looking at the
>ISO-10646 or Unicode definitions.

Different definitions in different versions of these standards
(and STD 63/RFC 3629) may be slighly out of sync at a certain time
because the different processes move at different speeds, but they
are carefully being syncronized. Also, because many of these issues
are security-related, it's not really so much an issue of what's
allowed by the standards; if you know about a security issue,
you make sure you deal with it independent of whether a given
standard has already been updated to include that issue or not.

>What about the Java modified UTF-8?

That's not UTF-8.

>Are some characters disallowed?  (Similar questions apply to IRIs.)

The IRI spec defines this quite clearly for it's purposes.

>Anything currently sending ISO-8859-1 would almost certainly be
>invalid UTF-8.  This is in fact useful.  It is quite common to test
>whether a byte sequence is valid UTF-8, and if not, treat it as
>ISO-8859-1, because the test is quite effective at distinguishing them
>in practice.

Yes indeed.

>So, for informative text (i.e. non protocol) such as text after a
>status code, it might be appropriate to recommend that TEXT be parsed
>as UTF-8 when valid, and ISO-8859-1 otherwise.

I agree with Albert that this would be a bad idea. I'm trying to
propose UTF-8 for new headers to get away from iso-8859-1, not to
perpetuate iso-8859-1.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp

Received on Friday, 28 March 2008 06:17:13 UTC