- From: Martin J. Duerst <duerst@w3.org>
- Date: Fri, 24 Jul 1998 14:51:17 +0900
- To: erik@netscape.com
- Cc: Dan Kegel <dank@alumni.caltech.edu>, MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>, Harald Alvestrand <Harald.Alvestrand@maxware.no>, Chris Newman <Chris.Newman@INNOSOFT.COM>, ietf-charsets@ISI.EDU, murata@fxis.fujixerox.co.jp, Tatsuo_Kobayashi@justsystem.co.jp
At 08:33 98/05/31 -0700, Erik van der Poel wrote: > Dan Kegel wrote: > > > > In the case of HTTP headers, we can probably consider the > > entire HTTP header stream as a single message, and only require > > the BOM at the beginning of the stream, e.g. the client and server > > would each send the BOM as the first two bytes after opening the > > socket. > > No, HTTP headers are always encoded with one octet per character, even > if the body is UCS-2 or UCS-4 (or UTF-16). You would have > interoperability problems if you tried to send the headers themselves in > UTF-16. A client could only send UTF-16 headers if it knew beforehand > that the server could deal with it. This is not exactly true. HTTP 1.1 for a very rare case (warnings) allows MIME-encoded (the (in)famous =? ? ? ?= syntax) headers. Other protocols, in particular email, allow this, too. I don't think that we should worry about the general problem of what a hypotetical new protocol will do with its headers and other protocol elements. The correct way to design such a protocol is to take only one, UCS-based, character encoding. The "charset" parameter and the MIME tag "UTF-16" then become irrelevant, even if the protocol should choose to use UTF-16. It will be the protocol's business to make sure they get around the big/little-endian issue, and we have to hope that they do so based on past experience. I also don't think we should worry about UTF-16 being used raw in the headers of traditional protocols. UTF-8 provides a much easier upgrade path for this case, and doesn't have endian problems. What I think we should worry is whether and how UTF-16 should be used in traditional protocol headers, based on MIME encoded words. Several solutions are possible: - Discourage or disallow UTF-16 in such headers (there are other cases, in particular Korean Email, where there are differences between the encoding used in the header and in the body). - Use a different specification for these headers (headers would probably be in big-endian without a BOM, and nothing else, bodies could tolerate little-endian and/or recommend/mandate the BOM). The difference is justified because headers need additional encoding/decoding anyway, and the user expectations for their legibility are somewhat lower. - Use exactly the same specifications for both headers and bodies. Regards, Martin. --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Thursday, 23 July 1998 23:33:01 UTC