- From: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
- Date: Sun, 15 Sep 96 12:22:37 CDT
- To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Working through the postings about character sets, I have come to believe we may be passing some things over in silence, because we think them too obvious to merit discussion. We all seem to agree that XML documents should be able to contain any character in ISO 10646. We all seem to agree that characters other than A-Z, a-z, 0-9, -, and . should be legal name- and name-start characters. (This will change a trivial task in parser development into one most of us have never performed, but everyone in the discussion seems willing to make a leap of faith here and believe Gavin Nicol when he says it's not very hard.) If I understand the postings correctly, the remaining differences of opinion start here, and there are several positions staked out, which I attempt to summarize below in ways that make clear how they agree and differ, or else expose my misunderstanding of people's views. 1. Hard Minimalism (Tim Bray): - all XML data streams must be in UTF-8 form - all XML systems must accept UTF-8 data - when data on disk is in non-UTF-8 form, responsibility for conversion rests outside the XML system - it's not said whether XML systems may accept data streams in other formats (e.g. Shift-JIS); XML parsers which feed data to applications must, however, feed them UTF-8 2. The Dual-Track Approach (James Clark): - all XML data streams must be in UTF-8 or UTF-16 - all XML systems must accept either UTF-8 or UTF-16 data, telling the difference by means of the xFEFF character conventionally used as a byte-order label in UTF-16 data streams - when data on disk is in non-Unicode form, responsibility for conversion rests outside the XML system (? I'm not sure JC was explicit about this) - it's not said whether XML systems may accept data streams in other formats (e.g. Shift-JIS) 3. Let 100 Flowers Bloom: Gavin Nicol and Todd Bauman have argued for a third position which I understand to have the following salient points: - XML data streams can be in any known or documentable encoding - XML systems may accept data streams in any format(s) they choose to support; they are encouraged but not required to accept UTF-8 - all XML systems must implement and rely on external specification of the coded character set / encoding, such as MIME or attributes on an FSI - each XML system must support content negotiation so clients and servers can avoid sending or receiving XML data in unsupported encodings This position seems, in some ways, to be even more minimalist than Tim Bray's, since there is *no* coded character set or encoding which *all* XML systems are required to support. ("XML browsers would not need to support any encodings other than those deemed important by the companies producing them" was Gavin Nicol's way of putting it.) A conforming XML system could legitimately restrict itself to handling ASCII or ISO 8859-1, or 96-character EBCDIC. For this reason, I propose naming this the Let-100-Flowers-Bloom position. Those uncomfortable with the allusion to Mao might prefer to call it the Laissez-Faire approach. 4. The Hard Maximalist Position: this is what I originally understood Nicol and Bauman to be arguing for; it's not wholly unlike the apparent intent of ISO 8879, as I understand it, though there are some obvious differences of detail. - XML data streams can be in any known or documentable encoding - all XML systems to implement and rely on external specification of the coded character set / encoding, such as MIME or attributes on an FSI - all XML systems to support parse-time specification of arbitrary 7- or 8-bit coded character sets, or any known Unicode encoding - each XML system to support content negotiation so clients and servers can know when to send a parse-time character-set specification and/or font - when data on disk is in a form not built in to the XML system, responsibility for declaring it rests with the user, and responsibility for using the declaration to convert the data into an appropriate internal form rests with the XML system 5. The Eclectic Compromise (DeRose): a slight extension of the Dual-Track approach: - XML data streams may be in any known or documentable encoding - all XML systems must accept UTF-8 data but may reject other formats - XML systems are encouraged to accept UTF-16 data, telling the difference by means of the xFEFF character conventionally used as a byte-order label in UTF-16 data streams - XML systems may at their option accept data in other formats; how they recognize the format (autodetection, external labels, internal label) is not specified - XML system must be able to emit a normalized form of any document they can accept; the normalized form is in UTF-8 (and thus can be read by any XML system) - when data on disk is in a non-supported form, responsibility for conversion rests outside the XML system It seems to me the differences among these proposals pose several questions, some of them surprising to me since I hadn't expected any differences of opinion: Q1 should there be any minimal function required of all conforming XML systems, any coded character set or character encoding they are all required to accept as input, whether across the net or from disk? Q2 should conforming XML systems be prohibited from accepting any input format they are not required to accept? Q3 if XML systems may accept different sets of input formats (whether or not these sets overlap), can we ensure interoperability in some way, or is that a lost cause? Q4 if XML systems may *only* accept Unicode (whether just UTF-8 or also UTF-16), is there anything that can be done to make life easier for users of current systems which rely on Ascii, ISO 8859-1 or 8859-*, JIS, Shift-JIS, EUC, etc.? It seems to me that there must be at least one encoding accepted by all XML systems; a parser that accepts ASCII only may be XML-Like, but it should not be XML. Period. It seems to me that requiring all users to fit filters on the front and back ends of all XML tools, to accomplish their Local-to-UTF8 and UTF8-to-Local conversions raises an unnecessary barrier to acceptance; to avoid this, it seems essential to allow an XML parser, at least one for use on local files, to read the native coded character set without prostheses. On the other hand, to ensure interoperability, we don't want such variations to be globally visible. Here's yet another proposal. 6. Limited Modfied Eclecticism: compromise between Eclectic Compromise and 100 Flowers: - XML data streams may be in any of a number of supported encodings: UTF-8, UTF-16, UCS-4, ISO 8859 - XML data streams must label themselves as to which supported encoding they are using, by means of a PI which must be the first data in each XML entity. - all XML systems must accept XML data in any supported encoding, detecting the encoding in use from the internal label; they may reject data in other encodings. (See note on autodetection, below.) - XML systems may optionally check the internal labeling for consistency with external labels (MIME, FSI, ...) and warn about inconsistencies or errors. - if the encoding of a data stream is not supported, the data stream is strictly in error; an XML system may however optionally recover from that error, e.g. to support a well known encoding in local use. At the user's option, warning messages for this error may be suppressed. Conforming XML systems must however allow a user option to have such errors reported (e.g. for the use of users about to send data to other sites which may not handle unsupported encodings). - XML system must be able to emit a normalized form of any document they can accept; the normalized form is in UTF-8 (and thus can be read by any XML system) - when data on disk is in a non-supported form, responsibility for conversion rests outside the XML system - when data on disk is in a supported form, responsibility for conversion to the XML system's internal form rests with the XML system What this boils down to is an attempt to allow XML systems to accept data in commonly used formats, without impeding interoperability. Systems are allowed to accept commonly used character encodings, just not to hide from their users the fact that XML strictly speaking requires one of the supported encodings. If we restrict the supported character encodings to UTF-8 and UTF-16, I think this proposal is only trivially different from the Dual-Track proposal or the Eclectic Compromise. If we add 8859 to the list, then the implementation burdens are only trivially increased (a UTF-8-based system has to autodetect the character encoding and translate to UTF-8 before actually reading the data), but the users' burdens seem substantially lighter. (Very few users want to have to deal with character-set problems, even if we put the filters on their disk for them.) I hesitate to add Shift-JIS etc. to the list of supported formats mostly because few programmers outside of Japan understand JIS, Shift-JIS, and EUC, and only a few more would be willing to learn enough to understand them if they had the opportunity. Note on autodetection of character sets. Before a parser can read the internal label, it has to know what character set is in use -- which is what the internal label is trying to tell it. This is why the SGML declaration doesn't provide fully automatic handling of foreign data. But if we limit ourselves to a finite set of supported formats, and give ourselves some clear text to begin with, then autodetection is a soluble problem. If each XML entity begins with a PI looking something like this: <?XML charset='...'> then the first part of the entity *must* be the characters '<?XML' and any conforming processor can detect, after four octets of input, which of the following cases apply (it may help to know that in Unicode, '<' is 0000 003C and '?' is 0000 003F): 1 x00 00 00 3C - UCS-4, big-endian machine (1234) x3C 00 00 00 - UCS-4, little-endian machine (4321) x00 00 3C 00 - UCS-4, weird machine (2143) x00 3C 00 00 - UCS-4, weird machine (3412) 2 x00 3C 00 3F - UCS-2, big-endian x3C 00 3F 00 - UCS-2, little-endian 3 x3C 3F 58 4D - ASCII, some part of 8859, or UTF-8 or any other ISO-flavor 7- or 8-bit set 4 x4C 6F E7 D4 - EBCDIC (in some flavor) 5 other - the data are corrupt, fragmentary, or enclosed in a wrapper of some kind (e.g. a Mime wrapper) Knowing that, it ought to be possible to handle things properly -- whether by invoking a separate lexical scanner for each case, or by calling the proper conversion function on each character of input. Tim and Gavin have already shown code fragments for this. This level of autodetection is enough to read the initial processing instruction with the pointer to the declarations and the character set identifier, which is still necessary to distinguish UTF-8 from 8859, and the parts of 8859 from each other (as well as the varieties of EBCDIC and so on). Like any self-labeling system, this can break if software changes the characte set or encoding and doesn't update the label. I get lots of mail with MIME labels saying (in EBCDIC) that the data are in ASCII. I don't think ASCII-EBCDIC gateways are the only places where such translations occur. So I still think that we need clear rules about network transfer, and what to do if you don't control the gateways (e.g. if you are going through someone else's ftp server or client, or via email). Perhaps we should say that network transmissions (or http transmission) should always be in UTF-8, and the other supported formats are only for local use on disk ...) Is this compromise workable? -C. M. Sperberg-McQueen ACH / ACL / ALLC Text Encoding Initiative University of Illinois at Chicago tei@uic.edu
Received on Sunday, 15 September 1996 15:49:51 UTC