- From: Larry Masinter <masinter@parc.xerox.com>
- Date: Fri, 24 Jul 1998 06:32:38 -0700 (PDT)
- To: "Martin J. Duerst" <duerst@w3.org>, erik@netscape.com
- Cc: Dan Kegel <dank@alumni.caltech.edu>, MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>, Harald Alvestrand <Harald.Alvestrand@maxware.no>, Chris Newman <Chris.Newman@INNOSOFT.COM>, ietf-charsets@ISI.EDU, murata@fxis.fujixerox.co.jp, Tatsuo_Kobayashi@justsystem.co.jp
I think we're getting into trouble in this case because we're trying to examine all of the possible senders and receivers of UTF-16 and then defining when they should or shouldn't include a BOM. However, if you had a registered charset, call it "marked-utf-16" with definition: Either big-endian UTF-16 or a single BOM followed by little-endian UTF-16 then it would seem to be clear what a sender should send and what a receiver should receive, without all of this complex case analysis. The problem is that we let the "utf-16" charset be registered with ambiguous semantics; if you don't feel confident enough to patch it, then register something else. Larry -- http://www.parc.xerox.com/masinter > -----Original Message----- > From: Martin J. Duerst [mailto:duerst@w3.org] > Sent: Thursday, July 23, 1998 10:51 PM > To: erik@netscape.com > Cc: Dan Kegel; MURATA Makoto; Harald Alvestrand; Chris Newman; > ietf-charsets@ISI.EDU; murata@fxis.fujixerox.co.jp; > Tatsuo_Kobayashi@justsystem.co.jp > Subject: Re: Revised proposal for UTF-16 > > > At 08:33 98/05/31 -0700, Erik van der Poel wrote: > > Dan Kegel wrote: > > > > > > In the case of HTTP headers, we can probably consider the > > > entire HTTP header stream as a single message, and only require > > > the BOM at the beginning of the stream, e.g. the client and server > > > would each send the BOM as the first two bytes after opening the > > > socket. > > > > No, HTTP headers are always encoded with one octet per character, even > > if the body is UCS-2 or UCS-4 (or UTF-16). You would have > > interoperability problems if you tried to send the headers themselves in > > UTF-16. A client could only send UTF-16 headers if it knew beforehand > > that the server could deal with it. > > This is not exactly true. HTTP 1.1 for a very rare case (warnings) allows > MIME-encoded (the (in)famous =? ? ? ?= syntax) headers. Other protocols, > in particular email, allow this, too. > > I don't think that we should worry about the general problem of what a > hypotetical new protocol will do with its headers and other protocol > elements. The correct way to design such a protocol is to take only > one, UCS-based, character encoding. The "charset" parameter and the > MIME tag "UTF-16" then become irrelevant, even if the protocol should > choose to use UTF-16. It will be the protocol's business to make sure > they get around the big/little-endian issue, and we have to hope that > they do so based on past experience. > > I also don't think we should worry about UTF-16 being used raw in the > headers of traditional protocols. UTF-8 provides a much easier upgrade > path for this case, and doesn't have endian problems. > > What I think we should worry is whether and how UTF-16 should be used > in traditional protocol headers, based on MIME encoded words. Several > solutions are possible: > > - Discourage or disallow UTF-16 in such headers (there are other > cases, in particular Korean Email, where there are differences > between the encoding used in the header and in the body). > > - Use a different specification for these headers (headers would > probably be in big-endian without a BOM, and nothing else, > bodies could tolerate little-endian and/or recommend/mandate > the BOM). The difference is justified because headers need > additional encoding/decoding anyway, and the user expectations > for their legibility are somewhat lower. > > - Use exactly the same specifications for both headers and bodies. > > > Regards, Martin. > > > > --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Friday, 24 July 1998 06:35:13 UTC