- From: Jamie Lokier <jamie@shareable.org>
- Date: Fri, 28 Mar 2008 09:45:06 +0000
- To: Stefan Eissing <stefan.eissing@greenbytes.de>
- Cc: Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Stefan Eissing wrote: > >1) Change the character encoding on the wire to UTF-8 > > -1 > > Some people proposed to handle headers by just looking at the octets > alone. While that is good advice to detect line ends and such, most > HTTP client implementations *have to* convert the header names and > values to "characters" since those are used in their own APIs and > host languages. Header names must be tokens anyway, which are a subset of US-ASCII. Nobody is proposing to change that. Header values: HTTP implementations which must use "character strings" tend to interpret octet as though they are ISO-8859-1, since that is mentioned in the RFC. So there is no actual problem with receiving UTF-8, only that the character strings don't contain the right characters for display and such (but can be converted to do so; the information is present). This is actually no different when they receive RFC2047 encodings, as most HTTP implementations will not decode those to the intended characters either, and those also have validation issues. So, in the case of receiving RFC2047 _or_ binary UTF-8, HTTP implementations using character strings internally will actually pass character sequences which aren't the intended "meaningful" characters, except for those in the US-ASCII subset. In that respect, binary UTF-8 on the wire doesn't change anything from the present situation with RFC2047 :-) If the wire encoding were officially UTF-8, those implementations will tend to validate what they receive as UTF-8, with issues if they receive something which doesn't validate. But the same applies in the present situation, in theory, as RFC2047 decoding can also fail to validate. It seems to me exactly the same issues apply with binary UTF-8 as with RFC2047 encodings, in terms of where decoding and validation occur (or don't), and the way mystical "I18n encoded into ISO-8859-1" is what is really handled within implementations. -- Jamie
Received on Friday, 28 March 2008 09:45:40 UTC