Re: PROPOSAL: i74: Encoding for non-ASCII headers from Jamie Lokier on 2008-03-28 (ietf-http-wg@w3.org from January to March 2008)

From: Jamie Lokier <jamie@shareable.org>
Date: Fri, 28 Mar 2008 09:45:06 +0000
To: Stefan Eissing <stefan.eissing@greenbytes.de>
Cc: Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <20080328094506.GE9323@shareable.org>

Stefan Eissing wrote:
> >1) Change the character encoding on the wire to UTF-8
> 
> -1
> 
> Some people proposed to handle headers by just looking at the octets  
> alone. While that is good advice to detect line ends and such, most  
> HTTP client implementations *have to* convert the header names and  
> values to "characters" since those are used in their own APIs and  
> host languages.

Header names must be tokens anyway, which are a subset of US-ASCII.
Nobody is proposing to change that.

Header values: HTTP implementations which must use "character strings"
tend to interpret octet as though they are ISO-8859-1, since that is
mentioned in the RFC.  So there is no actual problem with receiving
UTF-8, only that the character strings don't contain the right
characters for display and such (but can be converted to do so; the
information is present).  This is actually no different when they
receive RFC2047 encodings, as most HTTP implementations will not
decode those to the intended characters either, and those also have
validation issues.

So, in the case of receiving RFC2047 _or_ binary UTF-8, HTTP
implementations using character strings internally will actually pass
character sequences which aren't the intended "meaningful" characters,
except for those in the US-ASCII subset.

In that respect, binary UTF-8 on the wire doesn't change anything from
the present situation with RFC2047 :-)

If the wire encoding were officially UTF-8, those implementations will
tend to validate what they receive as UTF-8, with issues if they
receive something which doesn't validate.  But the same applies in the
present situation, in theory, as RFC2047 decoding can also fail to
validate.  It seems to me exactly the same issues apply with binary
UTF-8 as with RFC2047 encodings, in terms of where decoding and
validation occur (or don't), and the way mystical "I18n encoded into
ISO-8859-1" is what is really handled within implementations.

-- Jamie

Received on Friday, 28 March 2008 09:45:40 UTC