- From: Mark Nottingham <mnot@mnot.net>
- Date: Thu, 10 Feb 2011 12:30:47 +1100
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: httpbis Group <ietf-http-wg@w3.org>, "Julian F. Reschke" <julian.reschke@gmx.de>
On 10/02/2011, at 12:19 PM, Martin J. Dürst wrote: > Hello Mark, > > On 2011/02/10 9:49, Mark Nottingham wrote: >> >> I think we should add an explicit statement to the specification regarding the character set; e.g., >> >> """ >> Note that field-values containing characters outside of the ISO-8859-1 character set [ref] are invalid. >> """ >> >> Probably near the grammar. >> >> Thoughts? > > I don't see the point of this, on two levels: > > - On a higher level, it's hopelessly outdated and against long-standing > IETF policy (UTF-8), and useless in wide parts of the world. This is re-hashing a very old argument. The current consensus in the WG is on this text in p1: """ Historically, HTTP has allowed field content with text in the ISO-8859-1 [ISO-8859-1] character encoding and supported other character sets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII character encoding [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII characters. Recipients SHOULD treat other (obs-text) octets in field content as opaque data. """ I.e., the door is left open for adventurous new headers to be defined in UTF-8. However, we're not talking about a new header here; we're talking about an established header with several implementations. Breaking previously correct implementations with such a change violates our charter. I suggest calling such headers explicitly invalid so that receivers who choose to implement error handling (as per the previous thread I started recently) have a "hook" to do so; i.e., if a string fails to decode as 8859-1, they can implement error handling to try it as UTF-8. It's not particularly elegant to do it this way, but it is workable given the constraints we have. > - On a lower level, it's wrong to talk about character *set* if you mean > the encoding, and if you mean it in that sense, it's irrelevant if > you put that near the grammar because ISO-8859-1 can contain any bytes > whatever, which means it's difficult to check this. Yes, I always get mixed up on this, thanks. -- Mark Nottingham http://www.mnot.net/
Received on Thursday, 10 February 2011 01:31:20 UTC