ERB discussions and decisions from Michael Sperberg-McQueen on 1996-11-14 (w3c-sgml-wg@w3.org from November 1996)

From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
Date: Thu, 14 Nov 96 15:55:55 CST
To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-Id: <199611150033.TAA28839@www10.w3.org>
The ERB met yesterday, 13 November 1996, to discuss the XML working
draft and approve the distribution of the current text at SGML '96
next week.  We considered a number of topics arising from the draft,
some of which have already been discussed, or are still being
discussed, on this list, and other of which have not received much
discussion.  Present:  Bosak, Bray (intermittently), Clark, DeRose,
Kimber, Maler, Magliery, Paoli, Sharpe, and Sperberg-McQueen.
Absent:  Hollander.

The author's apologies to busy members of the WG who would prefer a
shorter account of the decisions; recent claims on the WG list that
the ERB does not explain or discuss its decisions with the rest of
the WG have led me, perhaps mischievously, to provide as full a
discussion and explanation as my fingers can handle.

There's an executive summary at the end.


Given the number of major topics on which the WG appears not to have
reached consensus and the volume of comment lately, it seems safe to
say that some issues will require ongoing consideration and
discussion, and the text of the working draft which we can distribute
next week will be subject to change in non-trivial ways before we can
leave this phase of the project's work behind.  We considered
dropping the plan to distribute printed copies at SGML '96, in order
not to give a false impression of completeness.  On the whole,
however, the ERB thought that having printed copies available would
be worthwhile, and we decided to go ahead with the plan.  The cover
page will, like the current Web copies, identify the document as a
Working Draft, so the fact that it's not completely stable should be
visible to any reader.  And as the experience of the ERB and WG
shows, having something that appears completed is one of the best
ways to get people to read a draft and comment on it.

Since Henry Thompson raised the question directly:  no, it's not too
late for comments on substantive issues.  The document is a Working
Draft, and when the ERB stops work on it and moves to the next phase,
it will still be a Working Draft until the W3C advances it to Draft
Recommendation status, using the normal W3C procedures.  There is
some sentiment for avoiding the kind of violent swings in philosophy
and technical direction that characterize some working drafts in some
organizations, but in principle and in practice, working drafts are
subject to change, and discussion about what changes to make is
always appropriate unless the rules of the WG make it out of order
(e.g. while we focus on some specific issue).

In the meantime, it is too late for typographic corrections to be
included in the version distributed at SGML '96.

 * Jon Bosak suggested we reconsider the decision to include all the
Cougar entities as predefined entities in XML; examining the list
with more care while preparing it for inclusion in the spec, he had
noticed a number of inconsistencies and infelicities -- especially
the fact that the entity names for some Greek characters are taken
from the ISOgrk1 set and others from the ISOgrk3 set.  In discussion
with Jon, Anders Berglund had also pointed out some other problems
with the entity set.  In view of the negative reaction from some
members of the WG, it also appeared to some ERB members that the
inclusion of these entities, intended as a convenience for users,
would strike some users as the reverse.

Agreed (Paoli abstaining) to remove the Cougar entities from the
spec.

 * Agreed to keep the five entities lt, gt, amp, quot, and apos, for
use in escaping markup delimiters, and to retain their current status
as non-redeclarable.  Rationale:  we agreed long ago that entity
reference was the preferred method of escaping markup characters, and
it's clearly better if tools generating XML can rely on the standard
entity names.  To the lasting disappointment of the humorists among
us, the name squot was dropped in favor of apos, which occurs in 8879
Annex D, set ISOnum, as do the others.

 * Discussed, once more, question C.10 (allow or prohibit
non-deterministic content models).  In a recent meeting, there had
been wide support for reconsidering the decision (reported 6
November) to prohibit such models, for the sake of compatibility
(thus ensuring that SGML-based processors can handle all XML
documents). In particular, it was pointed out that many existing SGML
systems have no trouble at all with non-deterministic models, and
argued that the restriction to determinism is poorly motivated, since
it does not in fact provide serious benefits to implementors (this is
a disputed point) and (pace Charles Goldfarb) has at best a neutral
effect on legibility of content models by end users.  It was
generally thought that the spec would be cleaner without the
restriction.

In this meeting, this discussion was continued.  Some ERB members not
present at the earlier discussion argued against lifting the
prohibition, on the grounds that (a) WG 8 has not agreed to change
this rule in the revision of SGML, and there is no reason to think
such a change likely, (b) there are some widely used SGML systems
(more than one or two) which rely crucially on determinism in the
content model, and (c) some nondeterministic content models have no
deterministic equivalent, so the idea of providing an algorithm for
making all content models deterministic is not feasible.  Determinism
is not particularly important to XML, but in Full SGML, the AND
connector and the definition of start-tag omission interact with it
and make it far more important.  A minority asked what AND has to do
with it, and suggested that all cases of start-tag omission allowed
by the current rules would also be possible with nondeterministic
rules, as experimentation with any LALR(1) parser generator should
show.  We never did clarify how AND makes determinism more important,
and on the other hand we never did hear anything like a proof that
LR(1) parsing could handle all cases of start-tag omission, let alone
-- the really hard part -- an argument showing that LALR(1) parsing
can be documented suitably in ISO-type language.  A digression into
practical linguistics and stylistic criticism loomed, full of anecdotes
about standard-speak, but was luckily averted.

Decision:  retain the prohibition.  This was not unanimous, but my
notes don't record the vote, so I don't recall who besides Tim Bray
was in dissent.

 * We reconsidered also the decision on question C.14 (reported 6
Nov) to drop SGML's prohibition on overlapping sets of name tokens in
enumerated attribute types.  The SGML revision group is on record as
favoring this change, it seems to be agreed that there is no
technical reason for the prohibition, and dropping it gives users a
much improved tool.  The discussion in the WG, however, seems to
suggest that some portions of the community will be extremely, perhaps
excessively, alarmed if XML anticipates the SGML revision in this
regard.

We considered standing by the decision reported on 6 November (clean
this area up), reversing it (follow ISO 8879:1986, even though on
this point the rule is uniformly thought to be a design error), and
drop enumerated data types for attributes, and place them on the list
of constructs to be added in a later revision.  There was almost no
support for dropping enumerated types:  they are extremely useful
both for validation and for documenting the expected range of values,
and they make it possible for authoring systems to provide much
better support for attribute value specification.  The other two
possibilities were very closely matched, but after long discussion
the majority view came to be that the symbolic importance of
guaranteeing that all valid XML documents are valid SGML documents
outweighed the technical arguments.  The SGML community has over time
become accustomed, or at least resigned, to this rule; those members
of the HTML community who care at all about standardization and DTDs
are at least aware of this restriction already, so that its inclusion
in XML will not come as a total shock to them.  The base political
observation that they'll blame WG8, not us, may have occurred to some
minds besides my own, but it was not spoken.  It may not be true, in
any case.

Reversed the decision announced 6 Nov:  XML 1.0 now prohibits overlap
among enumerated attribute types declared in the same attribute-list
declaration.  Dissenting:  Bray, Hollander, Magliery.

 * We also reconsidered, for the second time, our decision on the
form of EMPTY elements.  Initially, we had agreed (deciding question
B.10 on 30 October) to allow both the form <e/> and the form <e>,
restricting the latter to cases where the element was explicitly
declared EMPTY but allowing <e/> whether a declaration was present or
not.  This was felt unsatisfactory by some members of the WG and ERB
because it requires all parsers to read and parse the DTD, even if
all they want is to detect element boundaries correctly in a single
entity.  It also leads to some unhappy choices between requiring
parsers to fetch external DTD entities *before even starting to parse
the document* or requiring users to include all declarations of EMPTY
elements in the internal subset, which could create maintenance
headaches of epic proportions.  As a result, we revisited the
question in early November (in the course of discussing question D.2,
reported 6 November) and sought ways to allow the <e> form without
effectively requiring all XML processors to read all of the DTD, all
of the time, before parsing.  As reported on 6 November, we agreed to
restrict the use of the <e> form to a set list of known EMPTY
elements, in order to allow users to create, if need be, documents
which are simultaneously processable HTML and valid XML.  There was
some concern about singling out a single DTD for special treatment,
but the importance of HTML on the Web (in the context of Documents on
the Web, *are* there other DTDs?) and the fact that SGML users
already have the same ability we were trying to give HTML users (i.e.
the ability to create valid XML documents processable with their
existing systems) outweighed those concerns.

The ad hoc nature of the decision, however, continued to bother
several people (as well as several members of the WG), so we took it
up again.  Paul Prescod suggested on 8 November that HTML's EMPTY
elements could be handled by defining them (in an XML-friendly
version of the HTML DTD) as empty but not EMPTY, allowing XML-aware
users to write <br></br>, etc., which (a) is legal XML and (b) can be
processed by existing HTML browsers.  Some ERB members felt some
misgivings about tagging EMPTY elements this way, but (a) it works,
(b) it can't be all that bad, since it has been repeatedly suggested
as an enhancement for SGML itself, (c) it works, (d) it's much less
bad than the hard-coded list of elements, and (e) it works.

Agreed (Bray and Sharpe dissenting) to replace the paragraph
requiring XML processors to handle <e>-style tags for HTML 3.2's
EMPTY elements, if they detected that they were handling HTML, and
replace it with a paragraph explaining the Prescod technique of
making valid XML documents which can work with HTML browsers not
fitted out with <e/> handling.

 * Agreed (by general assent) to add a required version declaration
in the form of an XML PI at the beginning of the document.  The
version information to be required, the character-set information
optional.  In the ensuing editorial work, the version, charset, and
RMD information were all merged into a single 'XML declaration'.

 * Agreed (by general assent) to allow processing instructions in the
DTD.  (This may have been decided already, but it wasn't in the
grammar and we voted rather than look it up.  The deadline was
looming pretty high at this point.)

Other items remaining undiscussed and undecided were implicitly
declared editorial questions for purposes of getting copy to the
printer in time for SGML '96 distribution.  The editors resisted the
temptation to seize this opportunity to restore the DSD syntax for
markup declarations.

The spec has now gone to the printer; the editors would like to thank
those members of the WG who sent us corrections and pointed out
errors.  It'll be a materially more complete, correct, and less
confusing document thanks to your efforts.

- C. M. Sperberg-McQueen


Summary:

 * Removed (Paoli abstaining) the Cougar entities from XML 1.0.
 * Retained lt, gt, amp, quot, apos as non-redeclarable entities.
 * Retained ban on nondeterministic content models.
 * Prohibited (Bray, Hollander, Magliery dissenting) overlap among
enumerated types in XML 1.0.
 * Dropped special handling of HTML EMPTY elements, added paragraph
explaining Prescod method of making HTML valid XML (Bray, Sharpe
dissenting).
 * Added version declaration.
 * Allowed processing instructions in DTD.
Received on Thursday, 14 November 1996 19:33:25 UTC