Concrete syntax, character sets

Before addressing Martin's questions, a couple of meta-remarks:

- I split the response into two - one on the meta/process issues, this
  on the substantive technical issues - and gave this one a different title. 
  Since we're apt to be discussing different things in parallel here, we
  should make an effort to do this.
- This is pretty well the first time these issues have been raised... the
  following represents my current opinion, highly susceptible to modification 
  by opinions & experiences expressed in this group.

At 07:19 PM 9/9/96 +0100, Martin Bryan wrote:

> For example, before choosing whether or
>not to retain the distinction between abstract and reference concrete

If you want such a distinction, use real SGML.  XML should use a hardwired 
concrete syntax.

>- HTML has extended the Quantities defined in the reference concrete syntax:
>should XML offer less flexibility than HTML?
>- the SGML community has already agreed on a new set of Quantity defaults
>for the next version of SGML: should XML offer less flexibitity than the
>next version of SGML

XML should have *no* concept of quantities.  Names, nesting depths, whatever,
can be as large as required to meet the requirements of the application.
One straightforward way to do this and preserve compatibility
with SGML is to require an XML processor to have the capability of writing
an appropriate SGML declaration to set the quantities high enough to make
a particular XML DTD valid.

>- the reference concrete syntax only permits the use of Latin alphanumeric
>characters in names of elements, attributes and tokens...
>- only Arabic numerals are recognised in the 1986 version of 8879...

If you want to use anything but 7-bit ASCII in markup, use real SGML.
XML should have the reference concrete syntax hardwired in.

>- the default character set in 8879 matches that of the reference concrete
>syntax: should users be able to select which character set is most
>appropriate for their documents and specify an SGML declaration in which
>only a subset of ISO  10646 is recognized as valid while still retaining the
>reference concrete syntax for markup?

*Good* point... with modern parsing and encoding technology, it seems like
it would be easy, and it would certainly be desirable, for XML 
data not to be limited to small old character sets.  On the other hand, with
XML, ultimate flexibility is of less importance than ease of implementation;
would it be thinkable to say that "all XML data is always in UTF8"?  It 
seems this would break almost nothing and allow almost anything you'd want 
to do.

Cheers, Tim Bray
tbray@textuality.com http://www.textuality.com/ +1-604-488-1167