Re: PROPOSAL: i74: Encoding for non-ASCII headers from Mark Nottingham on 2008-03-25 (ietf-http-wg@w3.org from January to March 2008)

From: Mark Nottingham <mnot@mnot.net>
Date: Wed, 26 Mar 2008 10:51:07 +1100
To: Roy T. Fielding <fielding@gbiv.com>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-Id: <C73F6D80-8A65-40C7-B064-B8E0B56E49C2@mnot.net>
Apologies, I omitted some critical aspects. TEXT is defined as:

>    The TEXT rule is only used for descriptive field contents and  
> values
>    that are not intended to be interpreted by the message parser.  
> Words
>    of *TEXT MAY contain characters from character sets other than ISO-
>    8859-1 [22] only when encoded according to the rules of RFC 2047
>    [14].
>
>        TEXT           = <any OCTET except CTLs,
>                         but including LWS>

And this text defining header field-values;

>        field-content  = <the OCTETs making up the field-value
>                         and consisting of either *TEXT or combinations
>                         of token, separators, and quoted-string>

This has caused confusion, because one of the possible readings is  
that *all* header field-values have the potential for carrying RFC2047- 
encoded TEXT. This needs to be clarified.

Furthermore, quoted-string inherits TEXT;

>    A string of text is parsed as a single word if it is quoted using
>    double-quote marks.
>
>        quoted-string  = ( <"> *(qdtext | quoted-pair ) <"> )
>        qdtext         = <any TEXT except <">>


...but is used in places that clearly have the potential for being  
interpreted by the message parser (e.g., parameter values, etags,  
accept-, cache- and expectation-extensions).

E.g., a reasonable reading of the specification is that two ETags, one  
using RFC2047 encoding, and one not, are equal; I haven't checked, but  
I doubt that anyone has implemented this when they do comparison.

One way to fix both of these problems is to state that encoding is  
possible in specific use cases, rather than having a blanket statement  
about it in TEXT that is easily missed and not well-implemented.

A secondary issue is what encoding should be used in those cases were  
it is reasonable to allow it. I'm not sure what the value of requiring  
that it be the same everywhere is; some payloads (e.g., IRIs, e-mail  
addresses) have well-defined "natural" encodings into ASCII that are  
more appropriate.

Mind you, personally I'm not religious about this; I just think that  
if we mandate RFC2047 encoding be used in new headers that need an  
encoding, we're going to be ignored, for potentially good reasons.

So, revised proposal:

  1) Remove "Words of *TEXT MAY contain characters from character sets  
other than ISO-8859-1 [22] only when encoded according to the rules of  
RFC 2047 [14]."
  2) Constrain TEXT to contain only characters from iso-8859-1.
  3) Add advice that, for a particular context of use, other  
characters MAY be encoded (whether that's strictly RFC2047, or more  
fine-grained advice TBD) by specifying it in that context.
  4) Add new issues for dealing with specific circumstances (e.g.,  
From, Content-Disposition, Warning) as necessary. If the outcome of #3  
is to require RFC2047, this is relatively straightforward.


On 26/03/2008, at 5:21 AM, Roy T. Fielding wrote:

> On Mar 25, 2008, at 4:06 AM, Mark Nottingham wrote:
>>
>> Based upon discussion, a proposal for closing i74:
>>
>> * p1, section 2.2 -
>>
>>> The TEXT rule is only used for descriptive field contents and  
>>> values that are not intended to be interpreted by the message  
>>> parser. Words of *TEXT MAY contain characters from character sets  
>>> other than ISO- 8859-1 [ISO-8859-1] only when encoded according to  
>>> the rules of [RFC2047].
>>
>>  - remove the requirement that only RFC2047 encoding be used;  
>> instead, recommend that context-specific encoding rules be used  
>> (giving examples), and failing that, the \u'nnnnnn' form from BCP137.
>>  - add new issues for dealing with specific circumstances (e.g.,  
>> From, Content-Disposition, Warning) as necessary.
>
> I see no reason to change the existing encoding requirement unless
> we are to allow raw UTF-8 in headers.  Anything else would just make
> the implementations worse.  BCP137 is not mature enough to use in  
> HTTP.
>
> ....Roy
>


--
Mark Nottingham     http://www.mnot.net/
Received on Tuesday, 25 March 2008 23:51:58 UTC