Comments: through clause 3 from Christopher R. Maden on 1996-11-13 (w3c-sgml-wg@w3.org from November 1996)

From: Christopher R. Maden <crm@ebt.com>
Date: Wed, 13 Nov 1996 00:06:41 GMT
To: w3c-sgml-wg@w3.org
Message-Id: <199611130006.AAA29146@phaser.EBT.COM>
Here's what I've found so far.  Nearly all of this is different from
comments sent last night; what's not is explicitly marked "Carried
from 0.01."  I'd appreciate some reaction, even if it's "that was too
long to bother reading".  I feel my last comments went into a void.

As noted previously, all productions checked thus far (through [60])
have no undefined references.

Comments updated for 0.02 of 10 November, PostScript from SunSite.

Clause 1.2, reference 2:
   ISO 10646

Isn't it ISO/IEC 10646?  I can't remember.

Clause 1.3, first example:
   symbol ::= expression

All the productions use ':=', not '::='.

Clause 1.5, production [1]:
   [1] S := (#x0020 | #x000a | #x000d | #x0009)+

This, as has been pointed out, does not allow non-Latin spaces.  I
don't mind the omission of typographic spaces, but CJK spaces (e.g.,
zenkaku) should be included.  At a minimum, this needs inclusion in
the known limitations of the draft until it's been fixed from the
Unicode tables.

Clause 1.5, production [8]:
   [8] Ignorable := ...

I think the choice of the word "Ignorable" is unfortunate; although
not the case, it *implies* that these characters should be ignored in
content, as well.  I don't really have a better suggestion; maybe
"WSChars" for Writing System Characters?

Clause 1.5, paragraph post-production [9]:
   ... a string which matches "-XML-" in a fashion...

When did the ERB decision announced as "1. Reservation of name space"
change from .XML. to -XML-?  Not that it matters, really.

Clause 2.2, first paragraph:
   A character is A character is an atomic...

Carried from 0.01.  Typo.

Clause 2.3, production [22]:
   | '<!DOCTYPE' (Name | S)+ ('[' [^]]* ']')? '>' /* doc type
     declaration */

Carried from 0.01.  The beginning of the production component allows a
jumble, and the end does not allow space between DSC and MDC.  If the
purpose is to simply allow recognition and skipping of the doctype
declaration, then

'<!DOCTYPE' [^>]* ('[' [^]]* ']' S?)? '>'

should suffice; if more restrictive syntax is warranted, then
something like the doctypedecl production (in 2.8) should be invoked.

Clause 2.3, last paragraph:
   The right angle bracket (>) may be represented using the string
   "&gt;", and must be so represented when it appears in the string
   "]]>", to avoid confusion with the marker for the end of a marked
   section.

It must be made explicit here that this does NOT work in a marked
section.  For SGML reasons, recognition of ]]> as a delimiter outside
a marked section is a problem, but this is not clear to non--SGML-
users.  The only reason, in their minds, to escape ">" will be to
prevent the end of the marked section - but entities won't be
recognized there.

A note should also be made that if the sequence "]]>" is needed in a
literal section, escaping of "<" and "&" by entity references will
work, but that a marked section will not.

Clause 2.4, first paragraph:
   Comments may appear anywhere that character data may, except in a
   marked section (more properly, comments appearing in a marked
   section will not be recognized as such).

Carried from 0.01.  Comments may appear in element content and in the
prolog, as well, no?  In other words, "Comments may appear anywhere,
except in a marked section; i.e., within element content, in mixed
content, or in a document type declaration subset (see doctypedecl)."

Clause 2.6, and in general:

Carried from 0.01.  Wherever using a term important to ISO 8879 in a
different manner from 8879, the term 8879 uses for the concept should
be given for reference.  In this case, the term "marked section" in
XML refers to what 8879 calls "CDATA marked section".  This should be
made clear in a note; as 8879 is referenced, some non-trivial portion
of implementers will make reference to it, and different terminology
may confuse them.

Clause 2.7:

Critique revised from 0.01.  I think that specifying two whitespace
modes for the processor is a mistake.  It complicates parsing, with
little gain.  Decisions about whitespace handling will need to be made
by a renderer anyway (e.g., to strip leading and trailing space from
each line in a preformatted block), and an indexer will ignore all of
it.  IOW, the application is saved nothing, and the processor is
complicated.

Preserve all namespace, period.  Barring this, make the -XML-SPACE
attribute value default to PRESERVE.

Clause 2.7, example:
   <!ATTLIST * -XML-SPACE (PRESERVE|COLLAPSE) #IMPLIED>

Critique revised from 0.01.  Is this line to be included verbatim in
all DTDs?  Is it a model that must be added to the ATTLIST declaration
for every element?  The discussion is not clear.  (Either case - the
necessity for an extra attribute on every element, or a bizarre
deviation from ATTLIST syntax - highlights the weirdness of this
DTD-specified whitespace handling scheme.)

Clause 2.8, production [31]:
   [31] Prolog := EncodingDecl? ...

Production [72] is defined as Encodingdecl, not EncodingDecl.

Clause 2.8, productions [33] and [34]:

Carried from 0.01.  The placement of the production group breaks up
the flow of text; the paragraph after refers to "these two subsets",
and I was very confused as to *which* two until I realized that they
had been referenced in the paragraph prior to the production group.
Move the group down a paragraph, just before the example, maybe.

Clause 2.8, production [33] (and [70]):
   [33] doctypedecl := '<!DOCTYPE' S Name ExternalID? S? ('['
                       internalsubset* ']' S?)? '>'
...
   [70] ExternalID  := 'SYSTEM' Literal

This mandates the form
<!DOCTYPE fooSYSTEM"foo.dtd"[...] >

Spaces (_ps_) are required by ISO 8879 [110] between _document type
name_, _external identifier_, and _document type declaration subset_.

I would recommend changes thusly:
[33] doctypedecl := '<!DOCTYPE' S Name (S ExternalID)? S ('['
                    internalsubset* ']' S?)? '>'
[70] ExternalID  := 'SYSTEM' S Literal

Clause 2.8, last example:
   <?XML encoding="UTF-8">

I believe that introduction of the encoding PI at this point is
premature, and will cause confusion.  Discussion of encoding PIs
should be restricted to a discussion within their own section.

Clause 2.9, third paragraph:
    1. attributes with default values, and elements to which these
       attributes apply appear in the document, or

Carried from 0.01.  I think a more applicable phrasing is, "attributes
with default values, and elements to which these attributes apply *and
are not explicitly set* appear in the document..." though this may be
too complex to easily check.

Clause 2.9, last paragraph:
   If no RMD is provided, the effect is identical to an RMD with the
   value ALL.

I feel that NONE should be the default.  The simplest XML document
should not require the RMD at all.

Clause 3.1, second text paragraph:
   The Name in the start- and end-tag rules gives the element's type.

Carried from 0.01.  Strike "rules", or reword this.  "The Name
referred to in the ..." or "The Name in the ... -tags gives...".

Ibid:
   ... and the content of the QuotedCData (the characters between the
   "'" or '"' delimiters) as the attribute value.

Carried from 0.01 (but additional comment below).  Everyone here is
aware that this is the attribute value specification, but we use the
terms interchangeably.  We must NOT do this in the XML spec; it caused
endless headaches when Netscape started to handle entity refs in
attribute value *specifications*.  The discussions about when to use
&amp; and when to use %24 in <a href="..."> went for far too long on
www-html, html-wg, and lynx-dev.

Care must be taken in XML to use the correct terms "attribute value
specification" and "attribute value" as appropriate.  Even though
entity references are not allowed in AVSs in XML 1.0, lack of
confusion now will make going forward easier.

In addition, don't quote the quotes - this looks *really* confusing,
at least on paper.  It looks like "between the ''''' or ''''
delimiters".

Clause 3.1, post-production [39] paragraph:

The special casing of HTML must be eliminated from the specification.
It will *not* be implemented by most implementors, because they have
separate tools for handling HTML.  Therefore, most XML implementations
will be non-compliant, and this specification becomes moot anyway.

Clause 3.1, production group 17:
   content := (element | PCDATA | MS | PI | Comment)*

Carried from 0.01.  There should be [ VC: Content model ] after that;
i.e., the content of an element will match the content model in the
DTD if the document is valid.

Clause 3.2, first paragraph:
   A textual object is said to be a well-formed... if... it matches
   the production above labeled XML Document,...

Give a production number when they've settled.

Clause 3.2, second list item:
   More simply stated, the elements, delineated by start- and
   end-tags, nest within each other properly.

Carried from 0.01.  Either strike "properly" or define it.  Nesting
makes sense, I think, to the target non-SGML-aware audience; adding
"properly" implies that there's something special that's not being
said.

Clause 3.3.2, production [44] and discussion:
   [44] elements := cp

This allows violations of 8879 productions [116], [126], and [127],
which dictate that any element declaration other than ANY or EMPTY
(for XML's purposes) require grpo and grpc around the content model.
I recommend:

[44] elements := (choice | seq) ('?' | '*' | '+')?

Changing cp (my first thought) isn't good because it's fine to have a
naked Name in a choice or seq construct, just not as the main content
model.

Clause 3.4, productions [49] and [50]:
   [49] AttlistDecl := '<!ATTLIST' S Name AttDef+ S? '>'
   [50] AttDef      := S Name S AttType S Default

This is accurate, but I think a cleaner production would be:

[49] AttlistDecl := '<!ATTLIST' S Name (S AttDef)+ S? '>'
[50] AttDef      := Name S AttType S Default

It better reflects the syntactic components, IMO.

Clause 3.4.1, Validity checks:

ID and Idref do not mention normalization of case; Name token and Name
tokens do.  This is inconsistent with both NAMECASE GENERAL YES and
NAMECASE GENERAL NO.  It should be consistent.

I am opposed to case folding; I think it will be far easier to add it
in XML 2.0 if a workable method is found (which I doubt will happen).
The current method (assuming case folding was intended for ID and
Idref) will produce a different parse in France and in Canada for a
French-language document.  This small XML document:

<!DOCTYPE screwup [
<!ELEMENT screwup (stuff+)>
<!ELEMENT stuff EMPTY>
<!ATTLIST stuff id ID #IMPLIED>
]>
<screwup>
<stuff id="�cole"/>
<stuff id="ecole"/>
</screwup>

is valid if parsed in Canada, but invalid if parsed in France.  That
is a Bad Thing.  (Should we add an XML PI to indicate intended parsing
locale? d-:)

In addition, Name tokens mentions white space reduction; Idref and
Entity Name do not for their plural forms.  Name tokens does not
mention stripping of leading and trailing space; should it?

Clause 3.5 in 0.01/W3C (now missing):

Carried from 0.01.  The DTD summary is no longer needed for empty
elements, and is moot for mixed vs. element content distinction, but
would be a VERY useful way to override the defaulted entities without
requiring DTD parsing.  The receiving non-DTD-speaking application
could say, "This &prod; here does something other than a big ol' pi,
but I don't know what...".

-Chris
-- 
<!NOTATION SGML.Geek PUBLIC "-//GCA//NOTATION SGML Geek//EN">
<!ENTITY crism PUBLIC "-//EBT//NONSGML Christopher R. Maden//EN" SYSTEM
"<URL>http://www.ebt.com <TEL>+1.401.421.9550 <FAX>+1.401.521.2030
<USMAIL>One Richmond Square, Providence, RI 02906 USA" NDATA SGML.Geek>
Received on Tuesday, 12 November 1996 19:17:09 UTC