- From: Jamie Lokier <jamie@shareable.org>
- Date: Thu, 27 Mar 2008 21:28:18 +0000
- To: Robert Brewer <fumanchu@aminus.org>
- Cc: Mark Nottingham <mnot@mnot.net>, Martin Duerst <duerst@it.aoyama.ac.jp>, "Roy T. Fielding" <fielding@gbiv.com>, HTTP Working Group <ietf-http-wg@w3.org>
Robert Brewer wrote: > Hrm. I'm not sure what "other encodings" includes. When Jamie Lokier > says "I'm in favour of allowing UTF-8," does that mean the unicode > string u'\u212bngstr\xf6m' would emit as: > > If-Match: %E2%84%ABngstr%C3%B6m Quotation marks are needed (see grammar for entity-tag and quoted-string) and the UTF-8 octets are presented directly on the wire, not percent-encoded as you've done. > ...and how is the server supposed to know how to decode that? In this case, it should match the octets within the quoted string, which correspond with the octets it sent. _Allowing_ UTF-8 does not mean that the client is expected to validate or interpret UTF-8 in every place where TEXT occurs (e.g. an entity-tag). It means simply that it's allowed, and where meaningfully treated as _characters_, it might be decoded that way. It isn't meaningful for a client to decode the octets of an entity-tag into characters, though, and much better if it doesn't. It _is_ meaningful for a client to decode the TEXT of Reason-Phrase, e.g. for display purposes, but _not_ for protocol reasons. RFC2616 implies an agent displaying a Reason-Phrase should decode it to ISO-8859-1 characters and also apply RFC2047 decoding. Clearly, some agents don't do this. It seems quite unlikely to introduce any interop problems if that is respecified to say an agent should decode it as UTF-8 (and be leniant), or anything else similar. Decoding to characters should _not_ happen for protocol purposes, as "the TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser". So anything which simply passes around TEXT (e.g. quoted-string, entity-tag) should pass on the octets it receives without interpreting or modifying them (except to scan for quotation marks etc. as required for header parsing). I suggest this includes _not_ attempting to decode & re-encode protocol elements containing RFC2047 sequences, as in practice _that_ seems like it would cause interop and security issues, as most current software using HTTP would see that as an unexpected mangling of the text. -- Jamie
Received on Thursday, 27 March 2008 21:29:02 UTC