- From: Christopher R. Maden <crm@ebt.com>
- Date: Wed, 13 Nov 1996 00:06:41 GMT
- To: w3c-sgml-wg@w3.org
Here's what I've found so far. Nearly all of this is different from comments sent last night; what's not is explicitly marked "Carried from 0.01." I'd appreciate some reaction, even if it's "that was too long to bother reading". I feel my last comments went into a void. As noted previously, all productions checked thus far (through [60]) have no undefined references. Comments updated for 0.02 of 10 November, PostScript from SunSite. Clause 1.2, reference 2: ISO 10646 Isn't it ISO/IEC 10646? I can't remember. Clause 1.3, first example: symbol ::= expression All the productions use ':=', not '::='. Clause 1.5, production [1]: [1] S := (#x0020 | #x000a | #x000d | #x0009)+ This, as has been pointed out, does not allow non-Latin spaces. I don't mind the omission of typographic spaces, but CJK spaces (e.g., zenkaku) should be included. At a minimum, this needs inclusion in the known limitations of the draft until it's been fixed from the Unicode tables. Clause 1.5, production [8]: [8] Ignorable := ... I think the choice of the word "Ignorable" is unfortunate; although not the case, it *implies* that these characters should be ignored in content, as well. I don't really have a better suggestion; maybe "WSChars" for Writing System Characters? Clause 1.5, paragraph post-production [9]: ... a string which matches "-XML-" in a fashion... When did the ERB decision announced as "1. Reservation of name space" change from .XML. to -XML-? Not that it matters, really. Clause 2.2, first paragraph: A character is A character is an atomic... Carried from 0.01. Typo. Clause 2.3, production [22]: | '<!DOCTYPE' (Name | S)+ ('[' [^]]* ']')? '>' /* doc type declaration */ Carried from 0.01. The beginning of the production component allows a jumble, and the end does not allow space between DSC and MDC. If the purpose is to simply allow recognition and skipping of the doctype declaration, then '<!DOCTYPE' [^>]* ('[' [^]]* ']' S?)? '>' should suffice; if more restrictive syntax is warranted, then something like the doctypedecl production (in 2.8) should be invoked. Clause 2.3, last paragraph: The right angle bracket (>) may be represented using the string ">", and must be so represented when it appears in the string "]]>", to avoid confusion with the marker for the end of a marked section. It must be made explicit here that this does NOT work in a marked section. For SGML reasons, recognition of ]]> as a delimiter outside a marked section is a problem, but this is not clear to non--SGML- users. The only reason, in their minds, to escape ">" will be to prevent the end of the marked section - but entities won't be recognized there. A note should also be made that if the sequence "]]>" is needed in a literal section, escaping of "<" and "&" by entity references will work, but that a marked section will not. Clause 2.4, first paragraph: Comments may appear anywhere that character data may, except in a marked section (more properly, comments appearing in a marked section will not be recognized as such). Carried from 0.01. Comments may appear in element content and in the prolog, as well, no? In other words, "Comments may appear anywhere, except in a marked section; i.e., within element content, in mixed content, or in a document type declaration subset (see doctypedecl)." Clause 2.6, and in general: Carried from 0.01. Wherever using a term important to ISO 8879 in a different manner from 8879, the term 8879 uses for the concept should be given for reference. In this case, the term "marked section" in XML refers to what 8879 calls "CDATA marked section". This should be made clear in a note; as 8879 is referenced, some non-trivial portion of implementers will make reference to it, and different terminology may confuse them. Clause 2.7: Critique revised from 0.01. I think that specifying two whitespace modes for the processor is a mistake. It complicates parsing, with little gain. Decisions about whitespace handling will need to be made by a renderer anyway (e.g., to strip leading and trailing space from each line in a preformatted block), and an indexer will ignore all of it. IOW, the application is saved nothing, and the processor is complicated. Preserve all namespace, period. Barring this, make the -XML-SPACE attribute value default to PRESERVE. Clause 2.7, example: <!ATTLIST * -XML-SPACE (PRESERVE|COLLAPSE) #IMPLIED> Critique revised from 0.01. Is this line to be included verbatim in all DTDs? Is it a model that must be added to the ATTLIST declaration for every element? The discussion is not clear. (Either case - the necessity for an extra attribute on every element, or a bizarre deviation from ATTLIST syntax - highlights the weirdness of this DTD-specified whitespace handling scheme.) Clause 2.8, production [31]: [31] Prolog := EncodingDecl? ... Production [72] is defined as Encodingdecl, not EncodingDecl. Clause 2.8, productions [33] and [34]: Carried from 0.01. The placement of the production group breaks up the flow of text; the paragraph after refers to "these two subsets", and I was very confused as to *which* two until I realized that they had been referenced in the paragraph prior to the production group. Move the group down a paragraph, just before the example, maybe. Clause 2.8, production [33] (and [70]): [33] doctypedecl := '<!DOCTYPE' S Name ExternalID? S? ('[' internalsubset* ']' S?)? '>' ... [70] ExternalID := 'SYSTEM' Literal This mandates the form <!DOCTYPE fooSYSTEM"foo.dtd"[...] > Spaces (_ps_) are required by ISO 8879 [110] between _document type name_, _external identifier_, and _document type declaration subset_. I would recommend changes thusly: [33] doctypedecl := '<!DOCTYPE' S Name (S ExternalID)? S ('[' internalsubset* ']' S?)? '>' [70] ExternalID := 'SYSTEM' S Literal Clause 2.8, last example: <?XML encoding="UTF-8"> I believe that introduction of the encoding PI at this point is premature, and will cause confusion. Discussion of encoding PIs should be restricted to a discussion within their own section. Clause 2.9, third paragraph: 1. attributes with default values, and elements to which these attributes apply appear in the document, or Carried from 0.01. I think a more applicable phrasing is, "attributes with default values, and elements to which these attributes apply *and are not explicitly set* appear in the document..." though this may be too complex to easily check. Clause 2.9, last paragraph: If no RMD is provided, the effect is identical to an RMD with the value ALL. I feel that NONE should be the default. The simplest XML document should not require the RMD at all. Clause 3.1, second text paragraph: The Name in the start- and end-tag rules gives the element's type. Carried from 0.01. Strike "rules", or reword this. "The Name referred to in the ..." or "The Name in the ... -tags gives...". Ibid: ... and the content of the QuotedCData (the characters between the "'" or '"' delimiters) as the attribute value. Carried from 0.01 (but additional comment below). Everyone here is aware that this is the attribute value specification, but we use the terms interchangeably. We must NOT do this in the XML spec; it caused endless headaches when Netscape started to handle entity refs in attribute value *specifications*. The discussions about when to use & and when to use %24 in <a href="..."> went for far too long on www-html, html-wg, and lynx-dev. Care must be taken in XML to use the correct terms "attribute value specification" and "attribute value" as appropriate. Even though entity references are not allowed in AVSs in XML 1.0, lack of confusion now will make going forward easier. In addition, don't quote the quotes - this looks *really* confusing, at least on paper. It looks like "between the ''''' or '''' delimiters". Clause 3.1, post-production [39] paragraph: The special casing of HTML must be eliminated from the specification. It will *not* be implemented by most implementors, because they have separate tools for handling HTML. Therefore, most XML implementations will be non-compliant, and this specification becomes moot anyway. Clause 3.1, production group 17: content := (element | PCDATA | MS | PI | Comment)* Carried from 0.01. There should be [ VC: Content model ] after that; i.e., the content of an element will match the content model in the DTD if the document is valid. Clause 3.2, first paragraph: A textual object is said to be a well-formed... if... it matches the production above labeled XML Document,... Give a production number when they've settled. Clause 3.2, second list item: More simply stated, the elements, delineated by start- and end-tags, nest within each other properly. Carried from 0.01. Either strike "properly" or define it. Nesting makes sense, I think, to the target non-SGML-aware audience; adding "properly" implies that there's something special that's not being said. Clause 3.3.2, production [44] and discussion: [44] elements := cp This allows violations of 8879 productions [116], [126], and [127], which dictate that any element declaration other than ANY or EMPTY (for XML's purposes) require grpo and grpc around the content model. I recommend: [44] elements := (choice | seq) ('?' | '*' | '+')? Changing cp (my first thought) isn't good because it's fine to have a naked Name in a choice or seq construct, just not as the main content model. Clause 3.4, productions [49] and [50]: [49] AttlistDecl := '<!ATTLIST' S Name AttDef+ S? '>' [50] AttDef := S Name S AttType S Default This is accurate, but I think a cleaner production would be: [49] AttlistDecl := '<!ATTLIST' S Name (S AttDef)+ S? '>' [50] AttDef := Name S AttType S Default It better reflects the syntactic components, IMO. Clause 3.4.1, Validity checks: ID and Idref do not mention normalization of case; Name token and Name tokens do. This is inconsistent with both NAMECASE GENERAL YES and NAMECASE GENERAL NO. It should be consistent. I am opposed to case folding; I think it will be far easier to add it in XML 2.0 if a workable method is found (which I doubt will happen). The current method (assuming case folding was intended for ID and Idref) will produce a different parse in France and in Canada for a French-language document. This small XML document: <!DOCTYPE screwup [ <!ELEMENT screwup (stuff+)> <!ELEMENT stuff EMPTY> <!ATTLIST stuff id ID #IMPLIED> ]> <screwup> <stuff id="école"/> <stuff id="ecole"/> </screwup> is valid if parsed in Canada, but invalid if parsed in France. That is a Bad Thing. (Should we add an XML PI to indicate intended parsing locale? d-:) In addition, Name tokens mentions white space reduction; Idref and Entity Name do not for their plural forms. Name tokens does not mention stripping of leading and trailing space; should it? Clause 3.5 in 0.01/W3C (now missing): Carried from 0.01. The DTD summary is no longer needed for empty elements, and is moot for mixed vs. element content distinction, but would be a VERY useful way to override the defaulted entities without requiring DTD parsing. The receiving non-DTD-speaking application could say, "This ∏ here does something other than a big ol' pi, but I don't know what...". -Chris -- <!NOTATION SGML.Geek PUBLIC "-//GCA//NOTATION SGML Geek//EN"> <!ENTITY crism PUBLIC "-//EBT//NONSGML Christopher R. Maden//EN" SYSTEM "<URL>http://www.ebt.com <TEL>+1.401.421.9550 <FAX>+1.401.521.2030 <USMAIL>One Richmond Square, Providence, RI 02906 USA" NDATA SGML.Geek>
Received on Tuesday, 12 November 1996 19:17:09 UTC