- From: Tim Bray <tbray@textuality.com>
- Date: Tue, 10 Sep 1996 13:51:16 -0700
- To: w3c-sgml-wg@w3.org
Based on Gavin's experience that standard parsing tools can do the right thing with 10646 encodings, it seems like a very strong candidate for the best balance between flexibility, generality, and ease of implementation is: All XML documents will be encoded entirely in UTF8, data and markup. An XML processor will not perform any conversions on the data or markup, but will pass the data and markup to applications as they appear in the document. It seems like this tells implementors and users of tools *exactly* what they have to do, leaves no wriggle room, makes us language-independent to the extent that 10646 does (hard to beat), and supports implementation with standard tools. Obviously there are ways in which this could usefully be generalized; do any of these generalizations confer sufficient benefits to users that they are worth the extra implementation complexity? Gavin writes: >UTF8 doesn't solve the worlds problems. I think we can fix the >character repertoire, but fixing the encoding is arbitrary, and >prescribes certain implementation details. It also complicates usage. This is true; but I don't think the UTF-8 solution complicates usage, it just offloads the content conversion/interpretation problem from the parser. And the benefit - that anyone, anywhere, can write a simple program that will read *any XML document in the world* and, without recourse to any metadata, know what the bits mean - seems pretty large to me. Obviously, it would be of substantial public benefit to distribute, along with XML, a library of routines that convert stuff between UTF and {UCS*, ISO-8859-*, *JIS, etc...}. In fact, since the XML spec should include the API to the parser, we might even consider making at least some of these compulsory. But that's orthogonal to what the parser does. Cheers, Tim Bray tbray@textuality.com http://www.textuality.com/ +1-604-488-1167
Received on Tuesday, 10 September 1996 16:48:08 UTC