- From: Karlsson Kent - keka <keka@im.se>
- Date: Mon, 27 Jul 1998 16:52:17 +0200
- To: "'xml-editor@w3.org'" <xml-editor@w3.org>
Regarding Annex F. Autodetection of Character Encodings (Non-Normative): Please consider replacing the text (abbreviated here): "Because each XML entity not in UTF-8 or UTF-16 format must begin with an XML encoding declaration, in which the first characters must be '<?xml', any conforming processor can detect, after two to four octets of input, which of [...] [...] other: UTF-8 without an encoding declaration, or else the data stream is corrupt, fragmentary, or enclosed in a wrapper of some kind" ================== with the following text: ================== "In general, Unicode/10646 text may optionally be preceeded by start octets (sometimes referred to as 'signature' (10646), or 'byte order mark' (Unicode 2.0)). These are: 00 00 FE FF: UCS-4, big-endian, network octet order. FF FE 00 00: UCS-4, little-endian (strictly speaking, not conforming to 10646). FE FF: UTF-16, big-endian, network octet order. FF FE: UTF-16, little-endian (strictly speaking, not conforming to 10646). EF BB BF: UTF-8, (no byte order issue). Note that this is FEFF encoded in UTF-8. Start octets should not be regarded as part of the text data (but if they are, they encode a single no-break zero-width space character). Start octets (Byte Order Mark) are required by XML 1.0 of UTF-16 encoded XML text, and is required by XML 1.0 not to be part of the text data. XML processors can use start octets to detect in which encoding an entity is given, if the input is in Unicode and start octets are used. Further, because each XML entity not in UTF-8 or UTF-16 format must begin with an XML encoding declaration, in which the first characters must be '<?xml', any conforming processor can detect, after two to four octets of input, which of the following cases apply in the absence of start octets. In reading this list, it may help to know that in UCS-4, '<' is "#x0000003C" and '?' is "#x0000003F". 00 00 00 3C: UCS-4, big-endian (1234 order). 3C 00 00 00: UCS-4, little-endian (4321 order) (and thus, strictly speaking, not conforming to 10646). 00 3C 00 3F: UTF-16, big-endian, no start octets (and thus, strictly speaking, not conforming to the XML 1.0 specification). 3C 00 3F 00: UTF-16, little-endian, no start octets (and thus, strictly speaking, not conforming to the XML 1.0 specification, nor to 10646). 3C 3F 78 6D: UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the ASCII characters, the encoding declaration itself may be read reliably. 4C 6F A7 94: EBCDIC in some flavor; the full encoding declaration must be read to tell which code page is in use. other: UTF-8 without an encoding declaration, or else the data stream is corrupt, fragmentary, or enclosed in a wrapper of some kind. ================ (The second half is essentially unchanged, and I haven't double-checked the EBCDIC bit.) ================ The reason is that the suggested text is a bit clearer, and more in line with what the 10646 and Unicode specifications say. I have taken the libery to remove the example on "very unusual byte/octet orders", do they ever occur in practice? Kind regards /kent k
Received on Monday, 27 July 1998 10:52:26 UTC