Re: PROPOSAL: i74: Encoding for non-ASCII headers from Roy T. Fielding on 2008-03-26 (ietf-http-wg@w3.org from January to March 2008)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Tue, 25 Mar 2008 17:40:20 -0700
To: Mark Nottingham <mnot@mnot.net>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-Id: <429253C5-5B12-49EB-862A-39C5D2106E9E@gbiv.com>
On Mar 25, 2008, at 4:51 PM, Mark Nottingham wrote:
> Apologies, I omitted some critical aspects. TEXT is defined as:
>
>>    The TEXT rule is only used for descriptive field contents and  
>> values
>>    that are not intended to be interpreted by the message parser.  
>> Words
>>    of *TEXT MAY contain characters from character sets other than  
>> ISO-
>>    8859-1 [22] only when encoded according to the rules of RFC 2047
>>    [14].
>>
>>        TEXT           = <any OCTET except CTLs,
>>                         but including LWS>
>
> And this text defining header field-values;
>
>>        field-content  = <the OCTETs making up the field-value
>>                         and consisting of either *TEXT or  
>> combinations
>>                         of token, separators, and quoted-string>
>
> This has caused confusion, because one of the possible readings is  
> that *all* header field-values have the potential for carrying  
> RFC2047-encoded TEXT. This needs to be clarified.

It may need to be clarified, but that is not a possible reading.
ABNF rule field-content is only used in field-value which is only
used in message-header, which in turn is only used to represent the
superset of all header fields in the generic parser. In other words,
it is not a defining element for header generation.

> Furthermore, quoted-string inherits TEXT;
>
>>    A string of text is parsed as a single word if it is quoted using
>>    double-quote marks.
>>
>>        quoted-string  = ( <"> *(qdtext | quoted-pair ) <"> )
>>        qdtext         = <any TEXT except <">>
>
>
> ...but is used in places that clearly have the potential for being  
> interpreted by the message parser (e.g., parameter values, etags,  
> accept-, cache- and expectation-extensions).

Right, we should fix those cases that do not allow non-ASCII encodings.

> E.g., a reasonable reading of the specification is that two ETags,  
> one using RFC2047 encoding, and one not, are equal; I haven't  
> checked, but I doubt that anyone has implemented this when they do  
> comparison.
>
> One way to fix both of these problems is to state that encoding is  
> possible in specific use cases, rather than having a blanket  
> statement about it in TEXT that is easily missed and not well- 
> implemented.

+1

> A secondary issue is what encoding should be used in those cases  
> were it is reasonable to allow it. I'm not sure what the value of  
> requiring that it be the same everywhere is; some payloads (e.g.,  
> IRIs, e-mail addresses) have well-defined "natural" encodings into  
> ASCII that are more appropriate.

Unless we are going to change the protocol, the answer to that question
is ISO-8859-1 or RFC2047.  If we are going to change the protocol, then
the answer would be raw UTF-8 (HTTP doesn't care about the content of
TEXT as long as the encoding is a superset of ASCII, so the only
compatibility issue here is understanding the intent of the sender).

> Mind you, personally I'm not religious about this; I just think  
> that if we mandate RFC2047 encoding be used in new headers that  
> need an encoding, we're going to be ignored, for potentially good  
> reasons.

What good reasons?  In this case, we are not mandating anything.
We are simply passing through the one and only defined i18n solution
for HTTP/1.1 because it was the only solution available in 1994.
If email clients can (and do) implement it, then so can WWW clients.

People who want to fix that should start queueing for HTTP/1.2.

> So, revised proposal:
>
>  1) Remove "Words of *TEXT MAY contain characters from character  
> sets other than ISO-8859-1 [22] only when encoded according to the  
> rules of RFC 2047 [14]."

We can just introduce a new production for quoted-tokens.

>  2) Constrain TEXT to contain only characters from iso-8859-1.

No, that breaks compliant senders.

>  3) Add advice that, for a particular context of use, other  
> characters MAY be encoded (whether that's strictly RFC2047, or more  
> fine-grained advice TBD) by specifying it in that context.
>  4) Add new issues for dealing with specific circumstances (e.g.,  
> From, Content-Disposition, Warning) as necessary. If the outcome of  
> #3 is to require RFC2047, this is relatively straightforward.

There is no great need that has been established to support any
changes to the allowed TEXT encoding other than to separate the
rules that don't actually allow that encoding.  IMO, changes to
HTTP/1.1 must be motivated by actual implementations.

....Roy
Received on Wednesday, 26 March 2008 00:40:50 UTC