Re: Concrete syntax, character sets

Based on Gavin's experience that standard parsing tools can do the right thing 
with 10646 encodings, it seems like a very strong candidate for the best 
balance between flexibility, generality, and ease of implementation is:

 All XML documents will be encoded entirely in UTF8, data and markup.
 An XML processor will not perform any conversions on the data or markup, but 
 will pass the data and markup to applications as they appear in the document.

It seems like this tells implementors and users of tools *exactly* what
they have to do, leaves no wriggle room, makes us language-independent to
the extent that 10646 does (hard to beat), and supports implementation with 
standard tools.

Obviously there are ways in which this could usefully be generalized; do any
of these generalizations confer sufficient benefits to users that they
are worth the extra implementation complexity?

Gavin writes:
>UTF8 doesn't solve the worlds problems. I think we can fix the
>character repertoire, but fixing the encoding is arbitrary, and
>prescribes certain implementation details. It also complicates usage.

This is true; but I don't think the UTF-8 solution complicates usage, it just 
offloads the content conversion/interpretation problem from the parser.  And 
the benefit - that anyone, anywhere, can write a simple program that will
read *any XML document in the world* and, without recourse to any metadata,
know what the bits mean - seems pretty large to me.

Obviously, it would be of substantial public benefit to distribute, along
with XML, a library of routines that convert stuff between UTF and 
{UCS*, ISO-8859-*, *JIS, etc...}.  In fact, since the XML spec should include
the API to the parser, we might even consider making at least some of these
compulsory.  But that's orthogonal to what the parser does.

Cheers, Tim Bray
tbray@textuality.com http://www.textuality.com/ +1-604-488-1167

Received on Tuesday, 10 September 1996 16:48:08 UTC