Re: PROPOSAL: i74: Encoding for non-ASCII headers

Frank Ellermann wrote:
> RFC 2616 unfortunately says that it is Latin-1.  Jumping from
> ASCII to UTF-8 could work if existing implementations have no
> issues with non-ASCII octets.   Jumping from Latin-1 to UTF-8
> or maybe not (legacy) isn't secure, implementations won't know
> what it really is, UTF-8, Latin-1, windows-1252, they'd pass it
> on with "guess" to applications, and I'm not confident that it
> cannot cause havoc.  
> 
> For ASCII gibberish with odd =?...?.?...?= words I'm ready to
> bet on "no problem", but random octets go against my instincts.
> That is why I proposed to exclude 0x80..0x9F from Mark's 2(b).

I see why your instincts go that way, but I think it's a mistake to
assume "no problem".  Those sequences, when decoded differently by
different applications, can hide a lot of nasty things.

You said about passing strings labelled "guess" to applications.

Another comment said that HTTP implementations don't have to handle
every character set they receive in =?...?=.  If they don't understand
one, they simply leave the encoding as-is.

What do you think an HTTP client/server is going to do when it
receives a string containing =?utf8?q?...?= _and_ =?koi8-r?q?....?=,
and the HTTP module doesn't handle koi8-r, and passes the string on to
the application?  How will it label that?  Is there a danger of
"double-decoding"?  (Answer: yes).  Is double-decoding a security
issue?  (Answer: yes, if any if these text strings cause security
issues.)  What about =?iso-8859-1?q?=00?=, =?iso-8859-1?q?=0a?=,
=?obscure?=00?=, other sequences which hide significant characters,
and also sequences which hide ordinary text which various code is
searching for (e.g. in User-Agent), and which different parses decode
differently?  These are analogous to the URL security bugs which
viruses took advantage of a few years ago.

It's only because nobody decodes =?...?= at the moment (in HTTP) that
their existence isn't a problem.  If the recommendation to use them
for i18n text is taken seriously, and they are decoded, there are
ambiguity (and thus potential security) issues.  This is amplified by
different implementations decoding =?...?= strings differently, which
is guaranteed due to the open-ended character set names.

At least UTF-8, or "UTF-8 with ISO-8859-1 fallback for non-parseable
sequences", or indeed _anything_ :-) does not have as much scope for
different implementations to decode the text differently.

-- Jamie

Received on Saturday, 29 March 2008 02:10:17 UTC