- From: James Clark <jjc@jclark.com>
- Date: Thu, 12 Sep 1996 11:21:32 +0000
- To: w3c-sgml-wg@w3.org
> From: Gavin Nicol <gtn@ebt.com> > Date: Tue, 10 Sep 1996 19:55:42 GMT > >I think there are many good reasons - overwhelmingly good - for the > >SGML case. For XML, we should think about sweeping all the > >charset/encoding subtleties under the rug: it's UTF8 and that's all > >there is to it. > > UTF8 doesn't solve the worlds problems. I think we can fix the > character repertoire, but fixing the encoding is arbitrary, and > prescribes certain implementation details. It also complicates usage. I agree that fixing the character repertoire on ISO 10646 is the right way to go. (There appears to be consensus on this.) How much flexibility to allow with encodings is a difficult issue. Going with UTF-8 makes a lot of sense in the Unix world. Within a few years I expect to see most Unix platforms supporting UTF-8. But the Microsoft Windows world is going with UCS-2 (or UTF-16 which is an extension of UCS-2); system tools (such as Notepad) on Windows NT already support UCS-2, and I would expect Windows 9x to follow suit eventually. However, although many platforms do not currently have support for either UTF-8 or UCS-2/UTF-16, my impression is that most of them are planning support for one or the other. This suggests the possibility of allowing just these two encodings for XML. If we do this, we then have to face the issue of how an XML system will know which encoding is being used. I think there's a easy solution to this. The Unicode standard recommends that files start with a byte order mark (0xFEFF); since the character 0xFFFE is not a legal character this allows a Unicode system to determine whether big-endian or little-endian byte order is being used. Fortunately neither 0xFF nor 0xFE can start the UTF-8 representation of a character (actually I don't think they're legal anywhere). So XML could require that a byte order mark be used whenever UCS-2/UTF-16 is being used, and assume UTF-8 when the mark is not present. Restricting XML to these two encodings would, as Gavin says, be somewhat arbitrary and would indeed complicate usage at least in the short term for Japanese and European markets. I'm very sympathetic to Gavin's point of view here, but what's the alternative? - Allow any encoding at all? That's an implementation nightmare and users lose the guarantee that any XML system will be able to handle any XML document. - If not, what encodings do we allow? If we allow SJIS amd EUC, shouldn't we also support the Chinese, Taiwanese, Korean encodings as well, and all the ISO 8859-[123456789] for the Europeans. There are many other encodings that have active user communities. And if we have lots of possible encodings, we certainly can't auto-detect which are being used. The possibilities I can think of are: a) rely on some external mechanism to specify the encoding b) use ISO 2022 escape sequences c) require each XML file to have a MIME header with a charset parameter d) use an attribute in a formal system identifier Whichever way we go, there's an enormous step up in complexity over the UTF-8/UCS-2 route. James
Received on Thursday, 12 September 1996 06:27:06 UTC