- From: Mark Nottingham <mnot@mnot.net>
- Date: Mon, 20 Aug 2007 20:03:26 +1000
- To: Martin Duerst <duerst@it.aoyama.ac.jp>
- Cc: Richard Ishida <ishida@w3.org>, Apps Discuss <discuss@apps.ietf.org>, Felix Sasaki <fsasaki@w3.org>, "ietf-http-wg@w3.org Group" <ietf-http-wg@w3.org>, Paul Hoffman <phoffman@imc.org>
On 20/08/2007, at 5:54 PM, Martin Duerst wrote: > What do you mean by "encodings *like* 2047"? And why do you think > that UTF-8 may break existing implementations? See previous message. > UTF-8 has virtually > the same footprint in terms of bytes as ISO-8859-1: All bytes > above 0x7F may be used. Implementations that have to deal with > ISO-8859-1 usually do this by just being 8-bit-transparent; > that works for UTF-8, too. If utf-8 is a subset of iso-8859-1, it would work; but I don't think that's the case (not that I'm an expert in this area, by any means). > If your opinion is that UTF-8 cannot be allowed at all, then > that's going to be a problem for cases where it's already in > use, see e.g. earlier posts in the http list. Sorry, I looked and didn't find anything. Can you give references? > It's easy to say "may break existing implementations", > but in over 10 years of being involved in the Web, I haven't > heard about that happening. If you or anybody have, please > speak up. I'm wary of changing fundamental things in HTTP like the charset for headers because nobody's heard of it breaking things. My experience has been that the people who show up in this forum and the others that we circulate in are the tip of a much larger iceberg of implementations, and we make such changes at our peril. In particular, performance-sensitive implementations like proxies often rely on particular data structures that are mandated by the specs for optimisation. As you point out, they'll probably store them as bytes, but they may not. Client and server APIs have been built that make (again, reasonable) assumptions about headers' character sets. In both cases, changing the spec may break those implementations. Yes, it's just "may", but the burden of proof should be on those who want to make the change. To that end, if UTF-8 were widely used in HTTP headers already, I wouldn't be so uncomfortable, but AFAIK they're few and far between. >> For example, the encoding >> specified in 3987 is appropriate for URIs. > > As one of the authors of RFC 3987, I know what you mean, but > "the encoding specified in 3987" wouldn't be enough for a spec. > Also, it's not really very well suited to the job, because > %hh-encoding is used to escape any bytes, not only UTF-8, > and there is a considerably length increase for some scripts. I know. I didn't say it was a great solution, just one that works. >> However, it *has* to be >> explicit; I've heard some people read this requirement and think that >> they need to check *every* header for 2047 encoding. > > I have read it that way, too. If it can be safely argued that > it was never intended that way, and that no harm is produced > if this is restricted, then I'd befine with restricting it, > because checking everything for 2047 is indeed tough, but > I'd really like to make sure that this isn't creating problems. > (i.e. that a careful examination of the various headers makes > some reasonably conservative assumptions). I think that's a great first step. Taking a glance at the header registry <http://www.iana.org/ assignments/message-headers/perm-headers.html>, there are about 30 headers that might conceivably need non-ASCII content. Of those, the vast majority have URIs as a payload, which means that if you want to change the encoding, you'll need to change their syntax pretty fundamentally -- an incompatible change. Of the remaining ones, I see (with ones that may IMO be a problem marked with a *): * WWW-Authenticate, Proxy-Authenticate -- as discussed elsewhere * Authorization, Proxy-Authorization -- as discussed elsewhere * Content-Disposition -- can contain a filename - Cookie, Cookie2, Set-Cookie -- can contain user data, but it's opaque, so servers can choose their own encoding strategy - Server, Via, User-Agent -- not usually presented to users - Warning - can contain warning text, not usually presented to users - From -- an e-mail address (and possibly a candidate for deprecation) My earlier proposal would have the effect of clarifying the use of RFC2047 encoding in those marked with "*" above. It sounds like you want to change the syntax of those headers to allow UTF-8, no? If so, the problem that I have with that is that no existing conformance-minded implementer could have looked at RFC2616 and said to themselves, "oh, I should use/expect UTF-8 there," while such a statement could quite easily have been made about RFC2047. Now, if everybody has gone off and used UTF-8 despite the RFC, I could see an argument for changing the spec to match reality, but I don't think that's the case here. If there were a huge win here -- e.g., lots of improvements in i18n and efficiency -- again, it might be persuasive. But, looking at the list above, the benefits of such a change don't seem overwhelming. Again, I'm not an i18n expert, so please educate me. >> So, I think this means; >> >> 1) Change >> "Words of *TEXT MAY contain characters from character sets other >> than ISO-8859-1 [22] only when encoded according to the rules of RFC >> 2047 [14]." >> to >> "Words of *TEXT MUST NOT contain characters from character sets >> other than ISO-885901 [22]." >> and, >> >> 2) Identify headers that may have non-8859 > > There are many parts to ISO-8859, not just ISO-8859-1. Sorry, just using shorthand. >> content and explicitly say >> how to encode them (IRI, 2047, whatever; the existing ones will have >> to be 2047, I believe), modifying their BNF to suit. >> >> 3) When we document extensibility, require new headers to nominate >> any encoding explicitly. > > If that includes UTF-8, I'd be fine with it. If it excludes > UTF-8, I think that would be a problem. In my view, the proxy case and the API case above both rule that out; we can't assume that deployed software is set up to produce, accept or process UTF-8. > Regards, Martin. Cheers, -- Mark Nottingham http://www.mnot.net/
Received on Monday, 20 August 2007 10:03:55 UTC