Re: IRIs, IDNAbis, and HTTP from Julian Reschke on 2008-03-14 (ietf-http-wg@w3.org from January to March 2008)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Fri, 14 Mar 2008 13:57:38 +0100
To: Brian Smith <brian@briansmith.org>
CC: 'HTTP Working Group' <ietf-http-wg@w3.org>
Message-ID: <47DA7642.40504@gmx.de>

Brian Smith wrote:
>> ???
>>
>> <http://greenbytes.de/tech/webdav/rfc2616.html#basic.rules.quo
>> ted-string>:
>>
>>      quoted-string  = ( <"> *(qdtext | quoted-pair ) <"> )
>>      qdtext         = <any TEXT except <">>
> 
> <any TEXT except <">> is not equivalent to *TEXT.

I would argue that the intent of that production is clearly to inherit 
the rules for TEXT.

Funny enough, this issue is one of the remaining blockers for the 
conversion to ABNF; we really need to clarify TEXT, and all productions 
based on TEXT.

>> I think this is the intent.
> 
> Then you run into the question "How are media-ranges and media-types
> compared? Are they to be decoded into Unicode and then compared?" When

Yes.

> the specification specifies that ETags must match exactly, is the
> comparison character-by-character or octet-by-octet?

That's really not relevant as long as the producer of ETags always uses 
the same representation.

But I do agree what we probably need to look at each case where 
quoted-string is used and decide whether it requires I18N or not.

>>> Also, the Reason-phrase of the status line is defined as:
>>>
>>> 	*<TEXT, excluding CR, LF>
>>>
>>> But, is the RFC 2047 mechanism allowed in the Reason-phrase?
>> I would think so.
> 
> Again, the grammar for reason-phrase is not *TEXT, that is why it isn't
> clear

But this is indeed one of the cases where I18N makes sense.

Any chance that some of the original authors can explain the history here?

>>> And, if it is read liberally, then it is
>> I disagree.
>>
>>> allowed in way too many places. And, if it is allowed 
>>> anywhere, there should be some advice as to what
>>> encodings should be supported.
>>  From the headers above, where do you think it shouldn't be allowed?
> 
> Consider:
> 
>   Content-Type: text/plain;charset="=?utf-8?q?utf-8?="
>   (how do you compare this against 'text/plain;charset="utf-8"'?)

I would have hoped that RFC2045 answers this, but that doesn't seem to 
include a definition of quoted-string.

>   ETag: "=?utf-8?q?asdf?="
>   (how do you compare this against "asdf"?)
>
>   ETag: "=?"
>   (Is this a lexical error?)

For ETag, I'd say it's not a problem. If the server producing the ETags 
wants to cause problems, let it do so.

>> I do agree that if we rely on RFC2047, we may also have to 
>> spend some time improving that document.
> 
> Keep in mind that RFC2047 has a limit of 75 characters per encoded-word.
> And, the grammar seems to allow encoded-words to be mixed with unencoded
> words. And, Base-64 encoding to be muxed with quotable-printable. And,
> multiple encodings (e.g. UTF-8 and UTF-7) to be mixed. All in the same
> *TEXT segment. There is definitely a lot to be improved, but each
> improvement would be a incompatible change.

It seems the only way to improve RFC-2047 would be by introducing a new 
encoding that is sane. Such as:

"Any octet sequence starting with EF BB BF (the UTF-8 BOM) is to be 
interpreted as Unicode, encoded in UTF-8."

BR, Julian

Received on Friday, 14 March 2008 12:58:22 UTC