RE: Revised proposal for UTF-16

I think we're getting into trouble in this case because we're trying
to examine all of the possible senders and receivers of UTF-16 and then
defining when they should or shouldn't include a BOM. However, if you
had a registered charset, call it "marked-utf-16" with definition:

     Either big-endian UTF-16
     or a single BOM followed by little-endian UTF-16

then it would seem to be clear what a sender should send and what
a receiver should receive, without all of this complex case analysis.
The problem is that we let the "utf-16" charset be registered with
ambiguous semantics; if you don't feel confident enough to patch it,
then register something else.

Larry
--
http://www.parc.xerox.com/masinter
 

> -----Original Message-----
> From: Martin J. Duerst [mailto:duerst@w3.org]
> Sent: Thursday, July 23, 1998 10:51 PM
> To: erik@netscape.com
> Cc: Dan Kegel; MURATA Makoto; Harald Alvestrand; Chris Newman;
> ietf-charsets@ISI.EDU; murata@fxis.fujixerox.co.jp;
> Tatsuo_Kobayashi@justsystem.co.jp
> Subject: Re: Revised proposal for UTF-16
> 
> 
> At 08:33 98/05/31 -0700, Erik van der Poel wrote:
> > Dan Kegel wrote:
> > > 
> > > In the case of HTTP headers, we can probably consider the
> > > entire HTTP header stream as a single message, and only require
> > > the BOM at the beginning of the stream, e.g. the client and server
> > > would each send the BOM as the first two bytes after opening the
> > > socket.
> > 
> > No, HTTP headers are always encoded with one octet per character, even
> > if the body is UCS-2 or UCS-4 (or UTF-16). You would have
> > interoperability problems if you tried to send the headers themselves in
> > UTF-16. A client could only send UTF-16 headers if it knew beforehand
> > that the server could deal with it.
> 
> This is not exactly true. HTTP 1.1 for a very rare case (warnings) allows
> MIME-encoded (the (in)famous =? ? ? ?= syntax) headers. Other protocols,
> in particular email, allow this, too.
> 
> I don't think that we should worry about the general problem of what a
> hypotetical new protocol will do with its headers and other protocol
> elements. The correct way to design such a protocol is to take only
> one, UCS-based, character encoding. The "charset" parameter and the
> MIME tag "UTF-16" then become irrelevant, even if the protocol should
> choose to use UTF-16. It will be the protocol's business to make sure
> they get around the big/little-endian issue, and we have to hope that
> they do so based on past experience.
> 
> I also don't think we should worry about UTF-16 being used raw in the
> headers of traditional protocols. UTF-8 provides a much easier upgrade
> path for this case, and doesn't have endian problems.
> 
> What I think we should worry is whether and how UTF-16 should be used
> in traditional protocol headers, based on MIME encoded words. Several
> solutions are possible:
> 
> - Discourage or disallow UTF-16 in such headers (there are other
>   cases, in particular Korean Email, where there are differences
>   between the encoding used in the header and in the body).
> 
> - Use a different specification for these headers (headers would
>   probably be in big-endian without a BOM, and nothing else,
>   bodies could tolerate little-endian and/or recommend/mandate
>   the BOM). The difference is justified because headers need
>   additional encoding/decoding anyway, and the user expectations
>   for their legibility are somewhat lower.
> 
> - Use exactly the same specifications for both headers and bodies.
> 
> 
> Regards,   Martin.
> 
> 
> 
> 

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)

Received on Friday, 24 July 1998 06:35:13 UTC