- From: Jamie Lokier <jamie@shareable.org>
- Date: Thu, 27 Mar 2008 16:25:37 +0000
- To: Martin Duerst <duerst@it.aoyama.ac.jp>
- Cc: "Roy T. Fielding" <fielding@gbiv.com>, Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Martin Duerst wrote: > we are supposed to be most interested in is the protocol as implemented. > Just a few days ago, we have had what's probably the first > report of some actual iso-8859-1 data (some Spanish with > a wrong accent). But how much is that in terms of implementations? > > In other words, how much is actually going to break if we allow > new headers to use UTF-8, and they go ahead and use it? I'm in favour of allowing UTF-8. But we should probably also consider what recipients should do on recieving _invalid_ UTF-8 in that case. RFC3629 says: Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding invalid sequences may have security consequences or cause other problems. See Security Considerations (Section 10) below. Firstly, which part of a recipient is expected to process the UTF-8, or to pass it through? And should receipt of invalid UTF-8 be an error condition, or something which doesn't matter? From a pure protocol perspective, that means what should proxies do on receiving invalid UTF-8. Secondly, what is invalid UTF-8? It depends if you're looking at the ISO-10646 or Unicode definitions. What about the Java modified UTF-8? Are some characters disallowed? (Similar questions apply to IRIs.) Anything currently sending ISO-8859-1 would almost certainly be invalid UTF-8. This is in fact useful. It is quite common to test whether a byte sequence is valid UTF-8, and if not, treat it as ISO-8859-1, because the test is quite effective at distinguishing them in practice. So, for informative text (i.e. non protocol) such as text after a status code, it might be appropriate to recommend that TEXT be parsed as UTF-8 when valid, and ISO-8859-1 otherwise. -- Jamie
Received on Thursday, 27 March 2008 16:26:39 UTC