- From: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
- Date: Mon, 16 Sep 96 09:44:09 CDT
- To: Gavin Nicol <gtn@ebt.com>, W3C SGML Working Group <w3c-sgml-wg@w3.org>
On Mon, 16 Sep 1996 14:40:16 GMT Gavin Nicol said: >>Q1 should there be any minimal function required of all conforming XML >>systems, any coded character set or character encoding they are all >>required to accept as input, whether across the net or from disk? > >The coded character set should be ISO 10646. I am willing to accept >UTF-8 as required (I argued exactly that position on HTML-WG a long >time ago). I'm confused. Where I come from, 'coded character set' is a mapping between a set of characters and a set of bit patterns, not necessarily the same length. (I.e. I use the terms 'character set' and 'coded character set' as typically defined in ISO character set standards. I try unsuccessfully to avoid the term 'character set' precisely because it is used by SC 18 to mean what SC2 means by 'coded character set'.) Under that definition, if 'the coded character set should be ISO 10646', then we should not accept JIS 0208, Shift-JIS, EUC, ISO 8859, etc., because they are different coded character sets. Their character sets, aka character repertoires, happen to be a subset of that of ISO 10646 and of Unicode, but that does not make them the same coded character set, nor encodings of it. It just means that translation into 10646 or its encodings is not inherently lossy. So there seems to me to be an inherent contradiction in saying 'the coded character set should be ISO 10646' and 'we should allow XML documents to be in Shift-JIS or EUC or ...', which I thought you said elsewhere in your posting. I doubt that this contradiction is real, but I don't know how to resolve it. I.e., I don't understand what your position is. >>Q3 if XML systems may accept different sets of input formats (whether >>or not these sets overlap), can we ensure interoperability >>in some way, or is that a lost cause? > >Interoperability is something to be greatly desired, and in fact, the >primary reason I got involved in HTML I18N was precisely >that. However, I do not believe that at this time, we can get to a >point where all XML systems will be able to process all XML >documents. At some point in the future (3-5 years), perhaps. Now, no. This scares the pants off me. In 3-5 years, if XML is widely adopted, it will be *impossible* to impose interoperability in the form of required support for data streams in UTF-8 or UTF-16 or whatever, because by then there will be legacy systems and legacy data to be protected. The only way to achieve such uniformity is by imposing it at the outset, when there is no XML legacy data, and we have a free hand. Failing to ensure interoperability when we have a free hand is not a good sign for our ability to achieve it later when our hands will be tied by systems which have made use of whatever freedom the spec gives them now. >>Note on autodetection of character sets. > >Autodetection fails abysmally as soon as you get more than a few >encodings. Correct; that's why the proposal (a) limits itself to the cases of UCS-4, UTF-16 / UCS-2, ISO 646 and compatible encodings including UTF-8, and EBCDIC, and (b) requires an explicit label in the entity regardless. Autodetection in general fails when there are more than a few encodings. We agree. The question in my mind is, does this particular proposal for autodetection fail when we have the set of encodings described? > 5) Client/server interaction will initially be primarily in the native > encoding. > 6) Over time, a transition will be made to UTF-8/UTF-16 (ie. as more > and better tools become available). > >We should recognise, and accept this. Hmm. By analogy with this, HTML could have started by allowing any existing 7-bit national character set, as well as proprietary 8-bit character sets, and hoped that eventually, with time, all HTML users would migrate to ISO 8859-1. As it is, it started by prescribing 8859-1, which is a blessing because it allows users to protest against servers which serve data in native PC or native Mac format. Things are bad enough as they are; would they really be better if sending data in Mac's proprietary coded character set were legal HTML? On the whole, I think that HTML did the right thing in pressing for a move from 7 to 8 bits. And I think XML should do the same in pressing for a move to 16 bits. It is much easier to loosen XML restrictions in later revisions than to tighten them and break legacy data. -C. M. Sperberg-McQueen
Received on Monday, 16 September 1996 11:13:11 UTC