- From: Jamie Lokier <jamie@shareable.org>
- Date: Sat, 29 Mar 2008 02:09:39 +0000
- To: Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
- Cc: ietf-http-wg@w3.org
Frank Ellermann wrote: > RFC 2616 unfortunately says that it is Latin-1. Jumping from > ASCII to UTF-8 could work if existing implementations have no > issues with non-ASCII octets. Jumping from Latin-1 to UTF-8 > or maybe not (legacy) isn't secure, implementations won't know > what it really is, UTF-8, Latin-1, windows-1252, they'd pass it > on with "guess" to applications, and I'm not confident that it > cannot cause havoc. > > For ASCII gibberish with odd =?...?.?...?= words I'm ready to > bet on "no problem", but random octets go against my instincts. > That is why I proposed to exclude 0x80..0x9F from Mark's 2(b). I see why your instincts go that way, but I think it's a mistake to assume "no problem". Those sequences, when decoded differently by different applications, can hide a lot of nasty things. You said about passing strings labelled "guess" to applications. Another comment said that HTTP implementations don't have to handle every character set they receive in =?...?=. If they don't understand one, they simply leave the encoding as-is. What do you think an HTTP client/server is going to do when it receives a string containing =?utf8?q?...?= _and_ =?koi8-r?q?....?=, and the HTTP module doesn't handle koi8-r, and passes the string on to the application? How will it label that? Is there a danger of "double-decoding"? (Answer: yes). Is double-decoding a security issue? (Answer: yes, if any if these text strings cause security issues.) What about =?iso-8859-1?q?=00?=, =?iso-8859-1?q?=0a?=, =?obscure?=00?=, other sequences which hide significant characters, and also sequences which hide ordinary text which various code is searching for (e.g. in User-Agent), and which different parses decode differently? These are analogous to the URL security bugs which viruses took advantage of a few years ago. It's only because nobody decodes =?...?= at the moment (in HTTP) that their existence isn't a problem. If the recommendation to use them for i18n text is taken seriously, and they are decoded, there are ambiguity (and thus potential security) issues. This is amplified by different implementations decoding =?...?= strings differently, which is guaranteed due to the open-ended character set names. At least UTF-8, or "UTF-8 with ISO-8859-1 fallback for non-parseable sequences", or indeed _anything_ :-) does not have as much scope for different implementations to decode the text differently. -- Jamie
Received on Saturday, 29 March 2008 02:10:17 UTC