[Prev][Next][Index][Thread]

Re: Concrete syntax, character sets



[Responding to Tim Bray:]
 
| At 07:19 PM 9/9/96 +0100, Martin Bryan wrote:
| 
| > For example, before choosing whether or
| >not to retain the distinction between abstract and reference concrete
| >syntaxes
| 
| If you want such a distinction, use real SGML.  XML should use a hardwired 
| concrete syntax.

I totally agree.

| >- HTML has extended the Quantities defined in the reference concrete syntax:
| >should XML offer less flexibility than HTML?
| >- the SGML community has already agreed on a new set of Quantity defaults
| >for the next version of SGML: should XML offer less flexibitity than the
| >next version of SGML
| 
| XML should have *no* concept of quantities.  Names, nesting depths, whatever,
| can be as large as required to meet the requirements of the application.
| One straightforward way to do this and preserve compatibility
| with SGML is to require an XML processor to have the capability of writing
| an appropriate SGML declaration to set the quantities high enough to make
| a particular XML DTD valid.

I agree with Tim's intent here, but aren't some quantities related to
the particular document instance?  (Correct me if I'm wrong, but I
seem to recall cases where declarations worked fine for me until a
specific instance blew them up.)

| >- the reference concrete syntax only permits the use of Latin alphanumeric
| >characters in names of elements, attributes and tokens...
| >- only Arabic numerals are recognised in the 1986 version of 8879...
| 
| If you want to use anything but 7-bit ASCII in markup, use real SGML.
| XML should have the reference concrete syntax hardwired in.

Having just gone through a big struggle in WG8 and X3V1 over the ERCS
proposal, I would feel pretty strange about limiting markup to
something that not even Western Europeans could use the way they want
to.  I would like to see some serious discussion of this point.

| >- the default character set in 8879 matches that of the reference concrete
| >syntax: should users be able to select which character set is most
| >appropriate for their documents and specify an SGML declaration in which
| >only a subset of ISO  10646 is recognized as valid while still retaining the
| >reference concrete syntax for markup?
| 
| *Good* point... with modern parsing and encoding technology, it seems like
| it would be easy, and it would certainly be desirable, for XML 
| data not to be limited to small old character sets.  On the other hand, with
| XML, ultimate flexibility is of less importance than ease of implementation;
| would it be thinkable to say that "all XML data is always in UTF8"?  It 
| seems this would break almost nothing and allow almost anything you'd want 
| to do.

It's certainly thinkable to me.  Is it thinkable to say that "all
markup is in UTF8" as well?

Jon


Follow-Ups: References: