W3C home > Mailing lists > Public > xml-editor@w3.org > October to December 1998


From: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>
Date: Fri, 06 Nov 1998 10:29:06 +0900
Message-Id: <199811060129.AA02629@murata.apsdc.ksp.fujixerox.co.jp>
To: xml-editor@w3.org
Cc: w3c-xml-syntax-wg@w3.org, duerst@w3.org, mimasa@w3.mag.keio.ac.jp
Ishikawa-san at Keio W3C pointed out that the BOM in little-endian 
UTF-16 (0xFFFE) and the BOM in little-endian UCS-4 (0xFFFE0000) 
cannot be distinguisehd by examining the first two bytes.  Thus, 
Appendix F of XML 1.0 has to be modified.

FYI: Attached is quoted from A2 of N1396 ISO/IEC 10646-1 Corrigendum 
no. 2 (First draft - revised to 30 April 1996), which is available 
at http://osiris.dkuug.dk/JTC1/SC2/WG2/docs/N1396.doc



P.S.  It appears that RFC 2279 "UTF-8, a transformation format of ISO 10646" 
does not mention the encoding signature.

Annex F
The use of "signatures" to identify UCS

 This annex describes a convention for the identification of features
of the UCS, by the use of "signatures" within data streams of coded
characters. The convention makes use of the character ZERO WIDTH
NO-BREAK SPACE, and is applied by a certain class of applications.

When this convention is used, a signature at the beginning of a stream
of coded characters indicates that the characters following are
encoded in the UCS-2 or UCS-4 coded representation, and indicates the
ordering of the octets within the coded representation of each
character (see 6.3). It is typical of the class of applications
mentioned above, that some make use of the signatures when receiving
data, while others do not. The signatures are therefore designed in a
way that makes it easy to ignore them. In this convention, the ZERO
WIDTH NO-BREAK SPACE character has the following significance when it
is present at the beginning of a stream of coded characters:

UCS-2 signature: FEFF

UCS-4 signature: 0000 FEFF

UTF-8 signature: EF BB BF

UTF-16 signature: FEFF

An application receiving data may either use these signatures to
identify the coded representation form, or may ignore them and treat

If an application which uses one of these signatures recognises its
coded representation in reverse sequence (e.g. hexadecimal FFFE), the
application can identify that the coded representations of the
following characters use the opposite octet sequence to the sequence
expected, and may take the necessary action to recognise the
characters correctly.

NOTE - The hexadecimal value FFFE does not correspond to any coded
character within ISO/IEC 10646.

Fuji Xerox Information Systems
Tel: +81-44-812-7230   Fax: +81-44-812-7231
E-mail: murata@apsdc.ksp.fujixerox.co.jp
Received on Thursday, 5 November 1998 20:48:58 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:37:39 UTC