Re: Revised proposal for UTF-16 from Harald Alvestrand on 1998-05-24 (ietf-charsets@w3.org from April to June 1998)

From: Harald Alvestrand <Harald.Alvestrand@maxware.no>
Date: Sun, 24 May 1998 23:35:14 +0200
To: Dan Kegel <dank@alumni.caltech.edu>, Chris Newman <Chris.Newman@INNOSOFT.COM>, "Martin J. Duerst" <duerst@w3.org>
Cc: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>, ietf-charsets@ISI.EDU, murata@fxis.fujixerox.co.jp, Tatsuo_Kobayashi@justsystem.co.jp
Message-id: <3.0.2.32.19980524233514.0148d100@127.0.0.1>

At 13:56 24.05.98 -0700, Dan Kegel wrote:
>Perhaps a middle ground, here?  How about this (suitably reworded):
>   UTF-16 generators SHOULD [MUST?] NOT send in little-endian byte order, but
>   if they do, they MUST prefix the stream with a little-endian BOM.
>   UTF-16 consumers MUST assume the default byte-order is big-endian,
>   but MUST also accept little-endian if prefixed with a little-endian BOM.
>
>That way, big-endian is preferred, yet interoperability is preserved.

Hmmm.... everyone MUST do A, but if they don't, they MUST....

Suggested alternative:

 UTF-16 generators MUST send in big-endian byte order.

 NOTE: Some implementations that do not conform to this specification
 have occasionally sent data in little-endian byte order. When they do
 this, they commonly precede the data with a zero width non breaking
 space (also called Byte Order Mark or BOM) (0xFEFF).
 Thus, an UTF-16 parser encountering the code 0xFFFE as the first
 character of a purported UTF-16 stream may safely assume that he
 has encountered a nonconformant data source.

The info about what is right is there; the info about how to tell if
you encounter someone doing the Wrong Thing is there too.

                   Harald A

-- 
Harald Tveit Alvestrand, Maxware, Norway
Harald.Alvestrand@maxware.no

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)

Received on Sunday, 24 May 1998 14:38:27 UTC