RE: IRIs, IDNAbis, and HTTP from Brian Smith on 2008-03-14 (ietf-http-wg@w3.org from January to March 2008)

From: Brian Smith <brian@briansmith.org>
Date: Fri, 14 Mar 2008 05:31:22 -0700
To: "'HTTP Working Group'" <ietf-http-wg@w3.org>
Message-ID: <004601c885cf$50356460$4001a8c0@T60>

Julian Reschke wrote:
> Brian Smith wrote:
> > ...
> > It is not clear whether or not the RFC 2047 mechanism can 
> be used in 
> > quoted-string, because quoted-string is not defined in terms of 
> > "*TEXT", but rather a similar construct. Given all the places that 
> > quoted-string
> 
> ???
> 
> <http://greenbytes.de/tech/webdav/rfc2616.html#basic.rules.quo
> ted-string>:
> 
>      quoted-string  = ( <"> *(qdtext | quoted-pair ) <"> )
>      qdtext         = <any TEXT except <">>

<any TEXT except <">> is not equivalent to *TEXT.

> I think this is the intent.

Then you run into the question "How are media-ranges and media-types
compared? Are they to be decoded into Unicode and then compared?" When
the specification specifies that ETags must match exactly, is the
comparison character-by-character or octet-by-octet?

> > Also, the Reason-phrase of the status line is defined as:
> > 
> > 	*<TEXT, excluding CR, LF>
> > 
> > But, is the RFC 2047 mechanism allowed in the Reason-phrase?
> 
> I would think so.

Again, the grammar for reason-phrase is not *TEXT, that is why it isn't
clear

> > And, if it is read liberally, then it is
> 
> I disagree.
> 
> > allowed in way too many places. And, if it is allowed 
> > anywhere, there should be some advice as to what
> > encodings should be supported.
> 
>  From the headers above, where do you think it shouldn't be allowed?

Consider:

  Content-Type: text/plain;charset="=?utf-8?q?utf-8?="
  (how do you compare this against 'text/plain;charset="utf-8"'?)

  ETag: "=?utf-8?q?asdf?="
  (how do you compare this against "asdf"?)

  ETag: "=?"
  (Is this a lexical error?)

> I do agree that if we rely on RFC2047, we may also have to 
> spend some time improving that document.

Keep in mind that RFC2047 has a limit of 75 characters per encoded-word.
And, the grammar seems to allow encoded-words to be mixed with unencoded
words. And, Base-64 encoding to be muxed with quotable-printable. And,
multiple encodings (e.g. UTF-8 and UTF-7) to be mixed. All in the same
*TEXT segment. There is definitely a lot to be improved, but each
improvement would be a incompatible change.

- Brian

Received on Friday, 14 March 2008 12:31:58 UTC