- From: Jamie Lokier <jamie@shareable.org>
- Date: Thu, 27 Mar 2008 21:07:43 +0000
- To: Albert Lunde <atlunde@panix.com>
- Cc: HTTP Working Group <ietf-http-wg@w3.org>
Albert Lunde wrote: > > Anything currently sending ISO-8859-1 would almost certainly be > > invalid UTF-8. This is in fact useful. It is quite common to test > > whether a byte sequence is valid UTF-8, and if not, treat it as > > ISO-8859-1, because the test is quite effective at distinguishing them > > in practice. > > > > So, for informative text (i.e. non protocol) such as text after a > > status code, it might be appropriate to recommend that TEXT be parsed > > as UTF-8 when valid, and ISO-8859-1 otherwise. > > Alternate character encodings have been fruitful ground for security > attacks in the past. So I'd worry about adding too many alternate > ways to interpret header bytes, even if there is a way to distingush > them in most cases. That's a fair point, but: The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. And: TEXT = <any OCTET except CTLs, but including LWS> As a sender can choose any TEXT to send, and it has no protocol specified meaning, it is already allowed to transmit UTF-8 encoded character (excluding CTLs) in any TEXT production, from a perspective of what octet sequences implementations must accept. It's also allowed to transmit ISO-8859-1 (excluding CTLs). Of course, if the recipient displays the TEXT in a message, it might not show the characters intended by the sender in either case. There are a _lot_ of programs which interpret TEXT in reality: the comment in a User-Agent header is often matched for known substrings. Imho, it makes sense for that sort of matching to operate on the octets without attempting character decoding, rather than after character decoding. One reason is to avoid security holes, much as you suggest. Even decoding RFC2047 has the potential to introduce surprises there. -- Jamie
Received on Thursday, 27 March 2008 21:08:18 UTC