- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Thu, 10 Feb 2011 14:01:06 +0900
- To: Mark Nottingham <mnot@mnot.net>
- CC: httpbis Group <ietf-http-wg@w3.org>, "Julian F. Reschke" <julian.reschke@gmx.de>
Hello Mark, On 2011/02/10 10:30, Mark Nottingham wrote: > > On 10/02/2011, at 12:19 PM, Martin J. Dürst wrote: > >> Hello Mark, >> >> On 2011/02/10 9:49, Mark Nottingham wrote: >>> >>> I think we should add an explicit statement to the specification regarding the character set; e.g., >>> >>> """ >>> Note that field-values containing characters outside of the ISO-8859-1 character set [ref] are invalid. >>> """ >>> >>> Probably near the grammar. >>> >>> Thoughts? >> >> I don't see the point of this, on two levels: >> >> - On a higher level, it's hopelessly outdated and against long-standing >> IETF policy (UTF-8), and useless in wide parts of the world. > > This is re-hashing a very old argument. I know. Sorry, just couldn't stop. (I started over 10 years ago.) > The current consensus in the WG is on this text in p1: > > """ > Historically, HTTP has allowed field content with text in the ISO-8859-1 [ISO-8859-1] character encoding and supported other character sets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII character encoding [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII characters. Recipients SHOULD treat other (obs-text) octets in field content as opaque data. > """ > > I.e., the door is left open for adventurous new headers to be defined in UTF-8. That's good to know. > However, we're not talking about a new header here; we're talking about an established header with several implementations. Of course. > Breaking previously correct implementations with such a change violates our charter. > > I suggest calling such headers explicitly invalid so that receivers who choose to implement error handling (as per the previous thread I started recently) have a "hook" to do so; i.e., if a string fails to decode as 8859-1, they can implement error handling to try it as UTF-8. Well, as I wrote earlier below, ISO-8859-1 can contain any bytes whatever, in whatever order (because all characters are one byte). Although not strictly part of the definitions, the bytes in the rage 0x08-0x9F map to the C1 controls in the transcoding implementations I know. So detecting 'invalid ISO-8859-1' is pretty much a non-starter. (It's much different when you start with UTF-8, because UTF-8 requires a very particular kind of byte sequences and therefore is very easy to pick out and distinguish from other encodings. So it's easy and highly reliable to try with UTF-8 and fall back to ISO-8859-1 or whatever, but it's very difficult to try with ISO-8859-1 and fall back to UTF-8; even if you exclude 0x80-0x9F, you get a lot more false classifications. > It's not particularly elegant to do it this way, but it is workable given the constraints we have. > > >> - On a lower level, it's wrong to talk about character *set* if you mean >> the encoding, and if you mean it in that sense, it's irrelevant if >> you put that near the grammar because ISO-8859-1 can contain any bytes >> whatever, which means it's difficult to check this. > > Yes, I always get mixed up on this, thanks. I think you mean the "character set" vs. "character encoding" thing. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Thursday, 10 February 2011 18:12:54 UTC