- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Mon, 20 Aug 2007 16:54:20 +0900
- To: Mark Nottingham <mnot@mnot.net>
- Cc: Julian Reschke <julian.reschke@gmx.de>, Paul Hoffman <phoffman@imc.org>, Apps Discuss <discuss@apps.ietf.org>, "ietf-http-wg@w3.org Group" <ietf-http-wg@w3.org>, "Richard Ishida" <ishida@w3.org>, Felix Sasaki <fsasaki@w3.org>
Hello Mark, Thanks for giving this an issue number. At 12:40 07/08/20, Mark Nottingham wrote: >On 10/06/2007, at 6:05 PM, Martin Duerst wrote: >> - RFC 2616 prescribes that headers containing non-ASCII have to use >> either iso-8859-1 or RFC 2047. This is unnecessarily complex and >> not necessarily followed. At the least, new extensions should be >> allowed to specify that UTF-8 is used. > >My .02; > >I'm concerned about allowing UTF-8; it may break existing >implementations. > >I'd like to see the text just require that the actual character set >be 8859-1, but to allow individual extensions to nominate encodings >*like* 2047,without being restricted to it. What do you mean by "encodings *like* 2047"? And why do you think that UTF-8 may break existing implementations? UTF-8 has virtually the same footprint in terms of bytes as ISO-8859-1: All bytes above 0x7F may be used. Implementations that have to deal with ISO-8859-1 usually do this by just being 8-bit-transparent; that works for UTF-8, too. If your opinion is that UTF-8 cannot be allowed at all, then that's going to be a problem for cases where it's already in use, see e.g. earlier posts in the http list. It's easy to say "may break existing implementations", but in over 10 years of being involved in the Web, I haven't heard about that happening. If you or anybody have, please speak up. >For example, the encoding >specified in 3987 is appropriate for URIs. As one of the authors of RFC 3987, I know what you mean, but "the encoding specified in 3987" wouldn't be enough for a spec. Also, it's not really very well suited to the job, because %hh-encoding is used to escape any bytes, not only UTF-8, and there is a considerably length increase for some scripts. >However, it *has* to be >explicit; I've heard some people read this requirement and think that >they need to check *every* header for 2047 encoding. I have read it that way, too. If it can be safely argued that it was never intended that way, and that no harm is produced if this is restricted, then I'd befine with restricting it, because checking everything for 2047 is indeed tough, but I'd really like to make sure that this isn't creating problems. (i.e. that a careful examination of the various headers makes some reasonably conservative assumptions). >So, I think this means; > >1) Change > "Words of *TEXT MAY contain characters from character sets other >than ISO-8859-1 [22] only when encoded according to the rules of RFC >2047 [14]." >to > "Words of *TEXT MUST NOT contain characters from character sets >other than ISO-885901 [22]." >and, > >2) Identify headers that may have non-8859 There are many parts to ISO-8859, not just ISO-8859-1. >content and explicitly say >how to encode them (IRI, 2047, whatever; the existing ones will have >to be 2047, I believe), modifying their BNF to suit. > >3) When we document extensibility, require new headers to nominate >any encoding explicitly. If that includes UTF-8, I'd be fine with it. If it excludes UTF-8, I think that would be a problem. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Monday, 20 August 2007 07:56:15 UTC