Re: Content-Location: character set

On 10/02/2011, at 12:19 PM, Martin J. Dürst wrote:

> Hello Mark,
> 
> On 2011/02/10 9:49, Mark Nottingham wrote:
>> 
>> I think we should add an explicit statement to the specification regarding the character set; e.g.,
>> 
>> """
>> Note that field-values containing characters outside of the ISO-8859-1 character set [ref] are invalid.
>> """
>> 
>> Probably near the grammar.
>> 
>> Thoughts?
> 
> I don't see the point of this, on two levels:
> 
> - On a higher level, it's hopelessly outdated and against long-standing
>  IETF policy (UTF-8), and useless in wide parts of the world.

This is re-hashing a very old argument. The current consensus in the WG is on this text in p1:

"""
Historically, HTTP has allowed field content with text in the ISO-8859-1 [ISO-8859-1] character encoding and supported other character sets only through use of [RFC2047] encoding.  In practice, most HTTP header field values use only a subset of the US-ASCII character encoding [USASCII].  Newly defined header fields SHOULD limit their field values to US-ASCII characters.  Recipients SHOULD treat other (obs-text) octets in field content as opaque data.
"""

I.e., the door is left open for adventurous new headers to be defined in UTF-8. However, we're not talking about a new header here; we're talking about an established header with several implementations. 

Breaking previously correct implementations with such a change violates our charter.

I suggest calling such headers explicitly invalid so that receivers who choose to implement error handling (as per the previous thread I started recently) have a "hook" to do so; i.e., if a string fails to decode as 8859-1, they can implement error handling to try it as UTF-8. It's not particularly elegant to do it this way, but it is workable given the constraints we have.


> - On a lower level, it's wrong to talk about character *set* if you mean
>  the encoding, and if you mean it in that sense, it's irrelevant if
>  you put that near the grammar because ISO-8859-1 can contain any bytes
>  whatever, which means it's difficult to check this.

Yes, I always get mixed up on this, thanks.


--
Mark Nottingham   http://www.mnot.net/

Received on Thursday, 10 February 2011 01:31:20 UTC