character sets

Martin Bryan writes:
>- the reference concrete syntax only permits the use of Latin
>alphanumeric characters in names of elements, attributes and tokens:
>should XML be designed to allow users to define elements, attributes
>and their values in a form that is dependent on their local language,
>or must they restrict themselves to shared names that have meanings
>defined in English only?

Here's my main concern with proposals to restrict XML to specific
coded character sets (other than Unicode or UCS-4):  we have a good
chance here to provide a strong basis for internationalization, and
requiring ISO 646, or any particular flavor of ISO 8859, or even
*all* of the flavors of 8859, is not as good as defining XML
from the ground up as language-neutral.

We should take good care that XML is compatible with the proposals for
internationalization (or, as they say in the trade, i18n) already
formulated by the relevant W3C working group.  Those are good
proposals, and XML should harmonize with them.  (It would be nice to
have a list of what that might entail, though; any volunteers?)

>- the default character set in 8879 matches that of the reference
>concrete syntax: should users be able to select which character set
>is most appropriate for their documents and specify an SGML
>declaration in which only a subset of ISO  10646 is recognized as
>valid while still retaining the reference concrete syntax for markup?

This does not seem, offhand, to help much in keeping XML simple to
understand and implement.  Or am I too pessimistic?

-C. M. Sperberg-McQueen
 ACH / ACL / ALLC Text Encoding Initiative
 University of Illinois at Chicago

All opinions expressed in this note (except those I have quoted from
other authors) are mine.  They are not necessarily those of the Text
Encoding Initiative, its executive committee or other participants, its
sponsors, or its funders.  Anyone who says otherwise is wrong.