- From: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>
- Date: Sun, 31 May 1998 19:51:35 +0900
- To: Dan Kegel <dank@alumni.caltech.edu>, Harald Alvestrand <Harald.Alvestrand@maxware.no>, Chris Newman <Chris.Newman@INNOSOFT.COM>, "Martin J. Duerst" <duerst@w3.org>, ietf-charsets@ISI.EDU
- Cc: murata@fxis.fujixerox.co.jp, Tatsuo_Kobayashi@justsystem.co.jp
I think we are converging but minor differences exist. Little endian: should not or must not? Is the BOM mandatory or recommended? 1. Harald Alvestrand UTF-16 generators MUST send in big-endian byte order. NOTE: Some implementations that do not conform to this specification have occasionally sent data in little-endian byte order. When they do this, they commonly precede the data with a zero width non breaking space (also called Byte Order Mark or BOM) (0xFEFF). Thus, an UTF-16 parser encountering the code 0xFFFE as the first character of a purported UTF-16 stream may safely assume that he has encountered a nonconformant data source. There is no way to 100% reliably detect little-endian data that does not use the BOM. 2. Dan Kegel (in my interpretation) UTF-16 generators must begin with the BOM. They SHOULD [MUST?] NOT send in little-endian byte order, but if they do, they MUST prefix the stream with a little-endian BOM. UTF-16 consumers MUST assume the default byte-order is big-endian, but MUST also accept little-endian if prefixed with a little-endian BOM. 3. My proposal I would like to reduce useless options. Little endian is fine, but it should be used only in local environments. UTF-16 without the BOM is fine, but thee should be used only in local evrionments. Here is my proposal. UTF-16 generators MUST send in big-endian byte order and must begin with the zero width non breaking space (also called Byte Order Mark or BOM) (0xFEFF). NOTE: Some implementations that do not conform to this specification have occasionally sent data in little-endian byte order. When they do this, they commonly precede the data with the BOM. Thus, an UTF-16 parser encountering the code 0xFFFE as the first character of a purported UTF-16 stream may safely assume that he has encountered a nonconformant data source. If the BOM is absent, there is no way to 100% reliably detect little-endian data that does not use the BOM. Makoto Fuji Xerox Information Systems Tel: +81-44-812-7230 Fax: +81-44-812-7231 E-mail: murata@apsdc.ksp.fujixerox.co.jp --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Sunday, 31 May 1998 08:21:34 UTC