Re: i74: Encoding for non-ASCII headers from Mark Nottingham on 2008-03-18 (ietf-http-wg@w3.org from January to March 2008)

From: Mark Nottingham <mnot@mnot.net>
Date: Tue, 18 Mar 2008 14:18:23 +1100
To: Julian Reschke <julian.reschke@gmx.de>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-Id: <C6597EF7-1380-4D7F-A754-8971B3C4BACF@mnot.net>

Given what's below, I wonder whether specifying a single encoding for  
new headers is a practical thing to do; we already may end up  
referring to 3987, 2822, 2047, and 2231 for existing headers, because  
there are appropriate encodings from each of those domains (URIs, e- 
mail addresses, and so on).

If that's the case, maybe we should just recommend that new headers  
use the most appropriate encoding scheme to their domain, list a few  
examples (see above), and fall back to recommending (say) \u'nnnnnn'  
from BCP137 if nothing more specific applies.

On 17/03/2008, at 11:48 PM, Julian Reschke wrote:

>> If so, the next step would be to craft recommendations /  
>> requirements about what that mechanism will be. Possibilities  
>> discussed;
>> a) RFC2047
>
> I haven't seen any evidence this being implemented.
>
>> b) UTF-8
>
> Unfortunately, RFC2616, Section 4.2 currently states:
>
>    message-header = field-name ":" [ field-value ]
>    field-name     = token
>    field-value    = *( field-content | LWS )
>    field-content  = <the OCTETs making up the field-value
>                     and consisting of either *TEXT or combinations
>                     of token, separators, and quoted-string>
>
> Thus, if we take that as final word, we can't use anything but  
> Latin1, thus need to encode non-Latin-1 characters.
>
>> c) Something from BCP137 section 5
>
> ...which would be \u'nnnnnn' or &#xnnnnnn;...
>
>> d) IRI->URI
>
>> Separately, we'd need to open new issues for specifying these  
>> encodings for the field-values of:
>>  - From
>
> ...this one is currently defined in terms of RFC2822, Section 3.4...
>
>>  - Warning
>
> Currently explicitly refers to RFC2047.
>
>>  - Content-Location
> >   - Location
> >   - Referer
>
> These are URI references. No non-ASCII characters anyway.
>
>>  - Content-Dispostion (?)
>
> Content-Disposition uses I18N *inside* the parameters, for which  
> there already is RFC2231.

--
Mark Nottingham     http://www.mnot.net/

Received on Tuesday, 18 March 2008 03:19:01 UTC