Re: PROPOSAL: i74: Encoding for non-ASCII headers from Jamie Lokier on 2008-03-27 (ietf-http-wg@w3.org from January to March 2008)

From: Jamie Lokier <jamie@shareable.org>
Date: Thu, 27 Mar 2008 21:07:43 +0000
To: Albert Lunde <atlunde@panix.com>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <20080327210743.GA29492@shareable.org>

Albert Lunde wrote:
> > Anything currently sending ISO-8859-1 would almost certainly be
> > invalid UTF-8.  This is in fact useful.  It is quite common to test
> > whether a byte sequence is valid UTF-8, and if not, treat it as
> > ISO-8859-1, because the test is quite effective at distinguishing them
> > in practice.
> > 
> > So, for informative text (i.e. non protocol) such as text after a
> > status code, it might be appropriate to recommend that TEXT be parsed
> > as UTF-8 when valid, and ISO-8859-1 otherwise.
> 
> Alternate character encodings have been fruitful ground for security
> attacks in the past. So I'd worry about adding too many alternate
> ways to interpret header bytes, even if there is a way to distingush
> them in most cases.

That's a fair point, but:

   The TEXT rule is only used for descriptive field contents and values
   that are not intended to be interpreted by the message parser.

And:

   TEXT           = <any OCTET except CTLs, but including LWS>

As a sender can choose any TEXT to send, and it has no protocol
specified meaning, it is already allowed to transmit UTF-8 encoded
character (excluding CTLs) in any TEXT production, from a perspective
of what octet sequences implementations must accept.  It's also
allowed to transmit ISO-8859-1 (excluding CTLs).  Of course, if the
recipient displays the TEXT in a message, it might not show the
characters intended by the sender in either case.

There are a _lot_ of programs which interpret TEXT in reality: the
comment in a User-Agent header is often matched for known substrings.

Imho, it makes sense for that sort of matching to operate on the
octets without attempting character decoding, rather than after
character decoding.  One reason is to avoid security holes, much as
you suggest.  Even decoding RFC2047 has the potential to introduce
surprises there.

-- Jamie

Received on Thursday, 27 March 2008 21:08:18 UTC