Re: PROPOSAL: i74: Encoding for non-ASCII headers

Robert Brewer wrote:
> Hrm. I'm not sure what "other encodings" includes. When Jamie Lokier
> says "I'm in favour of allowing UTF-8," does that mean the unicode
> string u'\u212bngstr\xf6m' would emit as:
> 
>     If-Match: %E2%84%ABngstr%C3%B6m

Quotation marks are needed (see grammar for entity-tag and
quoted-string) and the UTF-8 octets are presented directly on the
wire, not percent-encoded as you've done.

> ...and how is the server supposed to know how to decode that?

In this case, it should match the octets within the quoted string,
which correspond with the octets it sent.

_Allowing_ UTF-8 does not mean that the client is expected to validate
or interpret UTF-8 in every place where TEXT occurs (e.g. an
entity-tag).  It means simply that it's allowed, and where
meaningfully treated as _characters_, it might be decoded that way.

It isn't meaningful for a client to decode the octets of an entity-tag
into characters, though, and much better if it doesn't.

It _is_ meaningful for a client to decode the TEXT of Reason-Phrase,
e.g. for display purposes, but _not_ for protocol reasons.

RFC2616 implies an agent displaying a Reason-Phrase should decode it
to ISO-8859-1 characters and also apply RFC2047 decoding.  Clearly,
some agents don't do this.

It seems quite unlikely to introduce any interop problems if that is
respecified to say an agent should decode it as UTF-8 (and be
leniant), or anything else similar.

Decoding to characters should _not_ happen for protocol purposes, as
"the TEXT rule is only used for descriptive field contents and values
that are not intended to be interpreted by the message parser".  So
anything which simply passes around TEXT (e.g. quoted-string,
entity-tag) should pass on the octets it receives without interpreting
or modifying them (except to scan for quotation marks etc. as required
for header parsing).

I suggest this includes _not_ attempting to decode & re-encode
protocol elements containing RFC2047 sequences, as in practice _that_
seems like it would cause interop and security issues, as most current
software using HTTP would see that as an unexpected mangling of the
text.

-- Jamie

Received on Thursday, 27 March 2008 21:29:02 UTC