Re: A28: syntax of markup declarations?

[this message is much shorter than you think.  The last 166 lines are a
 yacc grammar]

Dr. Charles F. Goldfarb wrote:
| XML should use a proper subset of the ISO 8879 declaration syntax, for several
| reasons:

He then gave a series of reasons...

I have marked some of them as spurions, which is not a very good word.
I do not mean that I do not think they are good points.  I mean that I
think they are not germane to XML.  Charles' point 10 is the most important,
though, and I have copied it at the start.  if you read nothing else,
read that!

| 10. Our objective for XML is to increase the SGML market by making is
| easier to understand and implement. We only get this result if XML *is* SGML;
| otherwise, SGML doesn't change at all.
| If XML is a conforming profile of SGML, it can be
| the core of SGML97 -- the basic conformance level. The rest of SGML would be
| defined as a delta on the core SGML; core XML/SGML users would never have to
| read it.

(I have commented on this point in sequence below, but I have reproduced
it here in case people missed it.  I did not think that changing SGML itself
was a possible goal of a W3C working group -- what did I miss?


| 1. The necessary subset is small, clean, and easily explained.
This is spurious, since if you didn't use that grammar, you would not have
to explain it at all.  I xcertainly would not accuse the SGML DTD grammar
of being clean -- you can't even put comments everywhere you can put
white space.  It is quirky, idiosyncratic, and hard to explain.

Perhaps I am not a good teacher.  It is true that with OMITTAG NO,
you get rid of the "- -", and if there is no CDATA, we don't have the
problem that CDATA means different things in different contexts.

| I have attached the grammar to this note. It has fewer than 30 productions.
| (SGML has almost 200.)

Well, that's true.  I have turned it into a YACC-style grammar that can
be read by we mortal C-programmers :-), and appended that below.  Actually
I am rusty with yacc and have not tested it for reduce/reduce conflicts,
but I think it's probably OK.  I wouldn't bet my socks on it, though.

I have done this so that people can judge more easily its value; the
SGML-style form is harder to turn directly into a parser using normal
compiler-writing tools.

For my own part, though, I would far rather see something like
Tim & Michael's syntax, and would far rather implement it, given a choice.

| 2. 20,000 or so people already know the DTD language.
| That is 20,000 more than know MGML.
And considerably more than 980,000 fewer than know HTML.
I.e. hardly anyone knows SGML by comparison.  Since it is hoped that
XML will be used by non-SGML users (no?), this is spurious.

| 3. It is the semantics of markup declarations that presents learning
| difficulties, not the syntax. The semantics will be the same in any case.
No, the syntax hinders it -- but only slightly, I agree.
There are a few gotchas -- e.g. you can't put comments in model groups,
and it's impossible to remember whether it's #CDATA or #EMPTY or not,
and when you need brackets, because it's so inconsistent.  So it would be
very nice to have something cleaner.

| 4. The same is true for implementation.  While a second syntax is a burden,
| it is a relatively small and easily automated one.
I am not sure what you mean by an easily automated implementation here.
It can be done with lex and yacc in under a day, I think, unless there is
something about it I have missed (not unlikely)!
And it will take up maybe 2 or 3 of our 20 pages.

I have simplified the grammar below, using "string" instead of system id,
public id, entity value, etc., assuming that the parser will check the
values at a higher level.

| 5. SGML instance markup is a great language for representing structured
| information. It is a poor language for defining it. Tim's paper.dsd is three
| times the size (in lines) of the attached paper.dtd.

If it is thirty times clearer and three times larger, is it ten times better?

| 6. All SGML tools can handle markup declarations. 
All the ADA tools in the world can handle ADA declarations, but that is
not a reason to use them.  If we wanted to use only SGML tools, there would
be no XML.

The right question is whether XML files can be trivially, automatically
and accurately converted to SGML files.

| 7. There are no SGML interoperability issues because it *is* SGML.
Modulo application conventions, CAPACITY, NAMELEN, RS/RE, etc.

In fact, since HTML is SGML (see RFC1866) why don't we just use that?
HTML 3.2 with the CLASS definition is very close to SGML, if you don't want
to use "obscure" features like marked sections, entities, your own
content models, processing instructions, notation, etc.
There are people on this list (I think) who don't find HTML to be a
very elegant language.  But it is more widely deployed than SGML, and
one reason is that SGML is too complex.

| 8. There is no problem putting markup declarations in "XML masquerading as
| HTML". Declarations just look like long unknown tags.
| (HTML users may even find them familiar for that reason.)
Er, have you tried this?  You are in for some interesting surprises :-)

| 9. XML needs to be a conforming subset of SGML; otherwise it will be seen
| as a competitor to SGML whatever our good intentions to the contrary.

I don't believe this.
If it is a problem, call it SGML Lite.  0.5 :-)

| 10. Our objective for XML is to increase the SGML market by making is
| easier to understand and implement. We only get this result if XML *is* SGML;
| otherwise, SGML doesn't change at all.

| If XML is a conforming profile of SGML, it can be
| the core of SGML97 -- the basic conformance level. The rest of SGML would be
| defined as a delta on the core SGML; core XML/SGML users would never have to
| read it.

I think this is the first really strong point you've made here, but it is
a good one.  If these are your objectives, though, this task should be
done by WG8 and *not* by the W3C.  The W3C is not a standards body, and
can only make recommendations to standards bodies such as the IETF or ISO.

If you want to change SGML, this is not, and cannot be, the forum.
I _do_ agree that your point number 10 is a good one.


DTD		: declarations

declarations	: declaration declarations
		| /* empty (this allows an empty DTD) */

declaration	: elemtype-decl
		| attlist-decl
		| entity-decl
		| notation-decl

elemtype-decl	: "<!ELEMENT" Gi modeltype ">"

modeltype	: "EMPTY"
		| "ANY
		| "(" expression ")"

expression	: "(" expression ")" numop_opt
		| term binop expression

numop_opt	: numopt
		| /* empty */

numop		: "?" | "*" | "+"

binop		: "," | "|" | "&"

term		: "#PCDATA" | Gi

attlist-decl	: "<ATTLIST" gi attdefs ">"

attdefs		: attdef more_attdefs

more_attdefs	: attdef more_attdefs
		| empty

attdef		: gi attvalue attdefault

		/* note: gi is used for any name; the client may
		 * to keep separate namespaces, uyt they are not
		 * distniguished syntactically

attvalue	: "CDATA"
		| tokenised
		| enumlist
		| notations

tokenised	: "ID"
		| "IDREF"
		| "IDREFS"
		| "ENTITY"
		| "NAME"
		| "NAMES"
		| "NUMBER"
		/* do we really need all (any) of those for xml?
		 * why not just have cdata plus a regular expression
		 * to match an attribute value?  Easier to implement,
		 * simpler, more powerful and more general!

enumlist	: "(" token tokenlist ")"

token		: gi
		| whole_number
		/* is that right? */

tokenlist	: token tokenlist
		| /* empty */

notations	: "NOTATION" "(" gi morenots_opt ")"

morenots_opt	: "|" gi morenots_opt
		| /* empty */

attdefault	: "#REQUIRED"
		| "#IMPLIED"
		| "#FIXED" string
		| string /* irregular: a default value has no keyword */

entity-decl	: "<ENTITY" entity_spec ">"

entity_spec	: gi entity_value
		| "%"  gi entity_value
		/* Note: this allows a parameter entity (%) to have NDATA;
		 * what would that mean?

entity_value	: external_spec ndata_entity_opt
		| string

external_spec	: "PUBLIC" string string_opt
		| "SYSTEM" string

string_opt	: string
		| /* empty */

ndata_entity_opt: "NDATA" gi /* data attributes go here? */
		| /* empty */

notation-decl	: "<!NOTATION" gi external_spec ">"

/* Notes:
 * I have assumed that the tokeniser eats unquotes whitespace, and that
 * comments are reduced to whitespace and then eaten.
 * Hence, both comments and whitespace separate tokens.
 * A comment is only recognised between the <! and > of a declaration;
 * comments should really be allowed anywhere, but that would not be
 * compatible with SGML.
 * The tokeniser should also recognise entity references and replace them.
 * The syntax is such that special care needs to be taken with
 * <!Entity % xxx "hello">
 * because the whitespace is significant after the % sign -- if there
 * was no space, it would be an entity reference,  This relies on the
 * SGML notion that % and & are only special when followed by a name
 * start character!
 * I have not distinguished between a string, a formal public identifier,
 * a system identifier, a new-fangled-non-sgml(yet)-formal-system-id,
 * and an entity value.  It would make the grammar easier to use if I did,
 * though.