RE: PROPOSAL: i74: Encoding for non-ASCII headers

Mark Nottingham wrote:
> >> My intent was not to disallow RFC2047, but rather to allow other
> >> encodings into iso-8859-1 where appropriate.
> ...
> 
> Roy said (to paraphrase) that IRIs do not show up in HTTP -- that
> they're just URIs. I agree with that, but only as far as you can view
> IRIs as an encoding into ASCII (albeit an imperfect one, because you
> can't round-trip them, since there's a bit of ambiguity).
> 
> RFC2047 is also an encoding into ASCII; it is not a character encoding
> in its own right. In that sense, it's a peer of BCP137 and other
> schemes that do similar things. They all end up taking characters from
> a set greater than that available to iso-8859-1 and encoding them into
> a subset of it (usually ASCII) using escape sequences.
> 
> That being the case, my question is this: is it realistic to require
> all headers to use RFC2047 encoding, to the exclusion of BCP137, etc?

BCP137 itself says "...this specification does not recommend one
specific syntax." That is, I don't see them as peers. RFC2047 is how
HTTP "defines the syntax" for TEXT already, which means any compliant
HTTP/1.1 implementation already has code for this. How wide are we going
to open the floodgates for other encodings? As a server author, I'd
rather not have to add large chunks of code in 2008 to become "http-bis
compliant". I'm pretty happy with RFC2047.

> I could understand such a requirement if we had a blanket requirement
> that RFC2047 encoding could occur anywhere, so that implementations
> could blindly decode/encode headers as necessary, whether they
> recognised them or not. However, we're not going in that direction,
> because it's not reasonable to implement...

I don't understand. From where I sit that sounds like not only a snap to
write from scratch, but has the potential to simplify a lot of
codebases.

> ...and in any case the encoding
> is already tied to the semantics of the headers somewhat, since you
> have to recognise the header to understand its structure enough to
> know where TEXT may appear (i.e., it's not a complete blanket, just an
> uneven one over TEXT).
> 
> That being the case, I can't help but see the RFC2047 requirement as
> spurious, and the most straightforward thing to do would seem to be to
> ditch the spurious requirement and move on -- without disallowing
> RFC2047 encoding from being specified in a particular header if that
> makes sense, but not disallowing other encodings either.

Hrm. I'm not sure what "other encodings" includes. When Jamie Lokier
says "I'm in favour of allowing UTF-8," does that mean the unicode
string u'\u212bngstr\xf6m' would emit as:

    If-Match: %E2%84%ABngstr%C3%B6m

...and how is the server supposed to know how to decode that? There's
one thing that RFC2047 provides that other, more minimal, encoding
schemes do not provide: the small bit of metadata that actually declares
which encoding is being used. If you want to encode your non-ASCII
header as UTF-8, fine, that's not in opposition to RFC2047:

    If-Match: =?utf-8?q?=E2=84=ABngstr=C3=B6m?=

It's not only utf-8 but the server knows it's utf-8 without having to
sniff anything.


Robert Brewer
fumanchu@aminus.org

Received on Thursday, 27 March 2008 17:28:24 UTC