- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Fri, 28 Mar 2008 15:14:11 +0900
- To: Jamie Lokier <jamie@shareable.org>
- Cc: "Roy T. Fielding" <fielding@gbiv.com>, Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
At 01:25 08/03/28, Jamie Lokier wrote: >I'm in favour of allowing UTF-8. But we should probably also consider >what recipients should do on recieving _invalid_ UTF-8 in that case. > >RFC3629 says: > > Implementations of the decoding algorithm above MUST protect against > decoding invalid sequences. For instance, a naive implementation may > decode the overlong UTF-8 sequence C0 80 into the character U+0000, > or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding > invalid sequences may have security consequences or cause other > problems. See Security Considerations (Section 10) below. > >Firstly, which part of a recipient is expected to process the UTF-8, >or to pass it through? And should receipt of invalid UTF-8 be an >error condition, or something which doesn't matter? From a pure >protocol perspective, that means what should proxies do on receiving >invalid UTF-8. Very clear: Pass things through. As RFC 2616 allows RFC 2047 encoding, and that already allows UTF-8, we already have that problem, and I don't think any proxy is doing anything else than pass things through. Checking is only necessary upon interpretation (e.g. conversion to other encoding, use for a filename,...). >Secondly, what is invalid UTF-8? It depends if you're looking at the >ISO-10646 or Unicode definitions. Different definitions in different versions of these standards (and STD 63/RFC 3629) may be slighly out of sync at a certain time because the different processes move at different speeds, but they are carefully being syncronized. Also, because many of these issues are security-related, it's not really so much an issue of what's allowed by the standards; if you know about a security issue, you make sure you deal with it independent of whether a given standard has already been updated to include that issue or not. >What about the Java modified UTF-8? That's not UTF-8. >Are some characters disallowed? (Similar questions apply to IRIs.) The IRI spec defines this quite clearly for it's purposes. >Anything currently sending ISO-8859-1 would almost certainly be >invalid UTF-8. This is in fact useful. It is quite common to test >whether a byte sequence is valid UTF-8, and if not, treat it as >ISO-8859-1, because the test is quite effective at distinguishing them >in practice. Yes indeed. >So, for informative text (i.e. non protocol) such as text after a >status code, it might be appropriate to recommend that TEXT be parsed >as UTF-8 when valid, and ISO-8859-1 otherwise. I agree with Albert that this would be a bad idea. I'm trying to propose UTF-8 for new headers to get away from iso-8859-1, not to perpetuate iso-8859-1. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Friday, 28 March 2008 06:17:13 UTC