Re: Content-Location: character set from Martin J. Dürst on 2011-02-10 (ietf-http-wg@w3.org from January to March 2011)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Thu, 10 Feb 2011 14:01:06 +0900
To: Mark Nottingham <mnot@mnot.net>
CC: httpbis Group <ietf-http-wg@w3.org>, "Julian F. Reschke" <julian.reschke@gmx.de>
Message-ID: <4D537112.7070205@it.aoyama.ac.jp>

Hello Mark,

On 2011/02/10 10:30, Mark Nottingham wrote:
>
> On 10/02/2011, at 12:19 PM, Martin J. Dürst wrote:
>
>> Hello Mark,
>>
>> On 2011/02/10 9:49, Mark Nottingham wrote:
>>>
>>> I think we should add an explicit statement to the specification regarding the character set; e.g.,
>>>
>>> """
>>> Note that field-values containing characters outside of the ISO-8859-1 character set [ref] are invalid.
>>> """
>>>
>>> Probably near the grammar.
>>>
>>> Thoughts?
>>
>> I don't see the point of this, on two levels:
>>
>> - On a higher level, it's hopelessly outdated and against long-standing
>>   IETF policy (UTF-8), and useless in wide parts of the world.
>
> This is re-hashing a very old argument.

I know. Sorry, just couldn't stop. (I started over 10 years ago.)

> The current consensus in the WG is on this text in p1:
>
> """
> Historically, HTTP has allowed field content with text in the ISO-8859-1 [ISO-8859-1] character encoding and supported other character sets only through use of [RFC2047] encoding.  In practice, most HTTP header field values use only a subset of the US-ASCII character encoding [USASCII].  Newly defined header fields SHOULD limit their field values to US-ASCII characters.  Recipients SHOULD treat other (obs-text) octets in field content as opaque data.
> """
>
> I.e., the door is left open for adventurous new headers to be defined in UTF-8.

That's good to know.

> However, we're not talking about a new header here; we're talking about an established header with several implementations.

Of course.

> Breaking previously correct implementations with such a change violates our charter.
>
> I suggest calling such headers explicitly invalid so that receivers who choose to implement error handling (as per the previous thread I started recently) have a "hook" to do so; i.e., if a string fails to decode as 8859-1, they can implement error handling to try it as UTF-8.

Well, as I wrote earlier below, ISO-8859-1 can contain any bytes 
whatever, in whatever order (because all characters are one byte). 
Although not strictly part of the definitions, the bytes in the rage 
0x08-0x9F map to the C1 controls in the transcoding implementations I 
know. So detecting 'invalid ISO-8859-1' is pretty much a non-starter. 
(It's much different when you start with UTF-8, because UTF-8 requires a 
very particular kind of byte sequences and therefore is very easy to 
pick out and distinguish from other encodings.

So it's easy and highly reliable to try with UTF-8 and fall back to 
ISO-8859-1 or whatever, but it's very difficult to try with ISO-8859-1 
and fall back to UTF-8; even if you exclude 0x80-0x9F, you get a lot 
more false classifications.

> It's not particularly elegant to do it this way, but it is workable given the constraints we have.
>
>
>> - On a lower level, it's wrong to talk about character *set* if you mean
>>   the encoding, and if you mean it in that sense, it's irrelevant if
>>   you put that near the grammar because ISO-8859-1 can contain any bytes
>>   whatever, which means it's difficult to check this.
>
> Yes, I always get mixed up on this, thanks.

I think you mean the "character set" vs. "character encoding" thing.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

Received on Thursday, 10 February 2011 18:12:54 UTC