XML spec., annex F

Regarding Annex F. Autodetection of Character Encodings (Non-Normative):

Please consider replacing the text (abbreviated here):
	"Because each XML entity not in UTF-8 or UTF-16 format must
	begin with an XML encoding declaration, in which the first
characters
	must be '<?xml', any conforming processor can detect, after two
to four
	octets of input, which of
	[...] 
	[...] 
	other: UTF-8 without an encoding declaration, or else the data
	stream is corrupt, fragmentary, or enclosed in a wrapper of some
kind"
==================
with the following text:
==================
	"In general, Unicode/10646 text may optionally be preceeded
	by start octets (sometimes referred to as 'signature' (10646),
or 'byte order
	mark' (Unicode 2.0)). These are:
		00 00 FE FF: UCS-4, big-endian, network octet order.
		FF FE 00 00: UCS-4, little-endian (strictly speaking,
			not conforming to 10646).
		FE FF: UTF-16, big-endian, network octet order.
		FF FE: UTF-16, little-endian (strictly speaking,
			not conforming to 10646).
		EF BB BF: UTF-8, (no byte order issue).  Note that this
is FEFF
			encoded in UTF-8.

	Start octets should not be regarded as part of the text data
(but if they are,
	they encode a single no-break zero-width space character).

	Start octets (Byte Order Mark) are required by XML 1.0 of UTF-16
encoded
	XML text, and is required by XML 1.0 not to be part of the text
data.

	XML processors can use start octets to detect in which encoding
an entity
	is given, if the input is in Unicode and start octets are used.
Further, because
	each XML entity not in UTF-8 or UTF-16 format must begin with an
XML
	encoding declaration, in which the first characters must be
'<?xml', any
	conforming processor can detect, after two to four octets of
input, which of
	the following cases apply in the absence of start octets. In
reading this list, it
	may help to know that in UCS-4, '<' is "#x0000003C" and '?' is
"#x0000003F".
		00 00 00 3C: UCS-4, big-endian (1234 order).
		3C 00 00 00: UCS-4, little-endian (4321 order)
			(and thus, strictly speaking, not conforming to
10646).
		00 3C 00 3F: UTF-16, big-endian, no start octets (and
thus,
			strictly speaking, not conforming to the XML 1.0
			specification).
		3C 00 3F 00: UTF-16, little-endian, no start octets
			(and thus, strictly speaking, not conforming
			to the XML 1.0 specification, nor to 10646).
		3C 3F 78 6D: UTF-8, ISO 646, ASCII, some part
			of ISO 8859, Shift-JIS, EUC, or any other 7-bit,
			8-bit, or mixed-width encoding which ensures
			that the characters of ASCII have their normal
			positions, width, and values; the actual
encoding
			declaration must be read to detect which of
these
			applies, but since all of these encodings use
the
			same bit patterns for the ASCII characters, the
			encoding declaration itself may be read
reliably.
		4C 6F A7 94: EBCDIC in some flavor; the full encoding
			declaration must be read to tell which code page
is in use.
		other: UTF-8 without an encoding declaration, or else
the data stream
			is corrupt, fragmentary, or enclosed in a
wrapper of some kind.
================
(The second half is essentially unchanged, and I haven't double-checked
the EBCDIC bit.)
================
The reason is that the suggested text is a bit clearer, and more in line
with what the 10646 and Unicode specifications say.  I have taken the
libery to remove the example on "very unusual byte/octet orders", do
they ever occur in practice?

			Kind regards
			/kent k

Received on Monday, 27 July 1998 10:52:26 UTC