Re: Revised proposal for UTF-16 from Dan Kegel on 1998-05-31 (ietf-charsets@w3.org from April to June 1998)

From: Dan Kegel <dank@alumni.caltech.edu>
Date: Sun, 31 May 1998 08:01:53 -0700
To: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>, Harald Alvestrand <Harald.Alvestrand@maxware.no>, Chris Newman <Chris.Newman@INNOSOFT.COM>, "Martin J. Duerst" <duerst@w3.org>, ietf-charsets@ISI.EDU
Cc: murata@fxis.fujixerox.co.jp, Tatsuo_Kobayashi@justsystem.co.jp
Message-id: <3.0.5.32.19980531080153.00a3b5a0@alumni.caltech.edu>

At 07:51 PM 5/31/98 +0900, MURATA Makoto wrote:
>I think we are converging but minor differences exist.  Little endian: 
>should not or must not?  Is the BOM mandatory or recommended?
>...
>3. My proposal
>
>I would like to reduce useless options.  Little endian is fine, but it 
>should be used only in local environments.  UTF-16 without the BOM is fine, 
>but thee should be used only in local evrionments.
>
>Here is my proposal.
>
> UTF-16 generators MUST send in big-endian byte order and must begin with the 
> zero width non breaking space (also called Byte Order Mark or BOM) (0xFEFF).
>
> NOTE: Some implementations that do not conform to this specification
> have occasionally sent data in little-endian byte order. When they do
> this, they commonly precede the data with the BOM.
> Thus, an UTF-16 parser encountering the code 0xFFFE as the first
> character of a purported UTF-16 stream may safely assume that he
> has encountered a nonconformant data source.  If the BOM is absent, 
> there is no way to 100% reliably detect little-endian data that does not 
> use the BOM.

I like this language!

There was one other issue raised: for protocols that send many
small text messages, should the BOM be sent in each string?
Examples given were HTTP headers and database protocols.

In the case of HTTP headers, we can probably consider the
entire HTTP header stream as a single message, and only require
the BOM at the beginning of the stream, e.g. the client and server
would each send the BOM as the first two bytes after opening the
socket.

In the case of database protocols, which send many short
strings, we might want to leave it up to the protocol spec
to say whether the byte order is specified globally
or included in each text string.

Examples and suggestions like the two above should probably
be included in the proposal.
- Dan

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)

Received on Sunday, 31 May 1998 08:07:12 UTC