Re: character sets - a summary and a proposal

At 12:22 PM 9/15/96 CDT, Michael Sperberg-McQueen wrote:

[after an *excellent* summary-of-the-position]

>Here's yet another proposal.
>6.  Limited Modfied Eclecticism:  compromise between Eclectic
>Compromise and 100 Flowers:
>  - XML data streams may be in any of a number of supported encodings:
>    UTF-8, UTF-16, UCS-4, ISO 8859
>  - XML data streams must label themselves as to which supported
>    encoding they are using, by means of a PI which must be the first
>    data in each XML entity.
>  - all XML systems must accept XML data in any supported encoding,
>    detecting the encoding in use from the internal label;
>    they may reject data in other encodings.
>    (See note on autodetection, below.)

... other good stuff

Your point about "if it just reads ASCII, it's not really XML" is well-taken; 
but setting the bar at a point which includes 8859 *and* UTF *and* UCS for 
basic  acceptance is I think serious infringement on our design goal #4 that 
says XML shall be easy to program.  Also, including 8859 but not JIS is
disturbingly Eurocentric.

Are there grounds for compromise between minimalism and eclecticism by saying
that (a) here are a list of encodings which should be supported, (b) entities
have to self-label with leading PI's, and (c) all XML implementations *must*
be able to read UTF8 as well as generate it? 

Second, might it be clever, for UTF8-encoded entities, to relax the requirement 
that all XML entities self-label ?  Not that UTF8 is morally superior or 
anything, but this would have the desirable side-effect of turning a large 
proportion of the SGML objects in the world into XML.

Cheers, Tim Bray
tbray@textuality.com http://www.textuality.com/ +1-604-488-1167