Re: character sets - a summary and a proposal
At 12:22 PM 9/15/96 CDT, Michael Sperberg-McQueen wrote:
[after an *excellent* summary-of-the-position]
>Here's yet another proposal.
>6. Limited Modfied Eclecticism: compromise between Eclectic
>Compromise and 100 Flowers:
> - XML data streams may be in any of a number of supported encodings:
> UTF-8, UTF-16, UCS-4, ISO 8859
> - XML data streams must label themselves as to which supported
> encoding they are using, by means of a PI which must be the first
> data in each XML entity.
> - all XML systems must accept XML data in any supported encoding,
> detecting the encoding in use from the internal label;
> they may reject data in other encodings.
> (See note on autodetection, below.)
... other good stuff
Your point about "if it just reads ASCII, it's not really XML" is well-taken;
but setting the bar at a point which includes 8859 *and* UTF *and* UCS for
basic acceptance is I think serious infringement on our design goal #4 that
says XML shall be easy to program. Also, including 8859 but not JIS is
Are there grounds for compromise between minimalism and eclecticism by saying
that (a) here are a list of encodings which should be supported, (b) entities
have to self-label with leading PI's, and (c) all XML implementations *must*
be able to read UTF8 as well as generate it?
Second, might it be clever, for UTF8-encoded entities, to relax the requirement
that all XML entities self-label ? Not that UTF8 is morally superior or
anything, but this would have the desirable side-effect of turning a large
proportion of the SGML objects in the world into XML.
Cheers, Tim Bray
email@example.com http://www.textuality.com/ +1-604-488-1167