PE Handling (again :-)

1 - There are places that PE processing must be disabled, which are not
    identified in the XML specification.  James Clark suggested to me
    that these places are:  "PIs, comments, SystemLiterals and PubidLiterals
    in the DTD" which seems appropriate.  Without such a clarification:

	* The XML spec can't be validated since the <!-- ... %foo; -->
	  comments WFC violations when they can't be expanded.

	* Public IDs (at least in the external subset) can only with
 	  much awkwardness hold the '%' character -- since it'd normally
	  flag a PE ref, a public ID like "-//fooCorp//DTD 80% done//EN"
	  would be reported as a fatal violation of PE reference syntax.

    Probably the best fix to the spec is to modify the description for
    each of those entities to say that PE expansion is not done within
    those constructs, and modify the text in 2.8 (right before the first
    VC) to comment that some productions preclude internal PE expansion
    beyond that in the grammar.

2 - There is ambiguity with respect to treatment of PE ref syntax
    within the internal DTD subset in the context of an attribute
    or entity value.  Section 4.4 doesn't cover this case, since
    (in particular) the text for the "PE, Occurs as Attribute Value"
    only talks about the "Outside the DTD" case.

    Meanwhile, back at the "PEs in Internal Subset" WFC, it seems
    this case is quite explicitly covered, and not according to the
    way that might be implied by the "Not Recognized" label  in 4.4
    (whose description clearly does not cover the "within DTD" case).
    It says those references are not allowed. (vs "not recognized").

    To be concrete:  I think it's most natural to report both these
    cases as fatal errors, rather than ignore either one.  (The
    first might be valid in an external parameter entity, though
    the second would still be a fatal error.)

	<!DOCTYPE root [ <!ATTLIST root foo "%pe;"    > ]><root/>
	<!DOCTYPE root [ <!ATTLIST root foo "%bad-pe" > ]><root/>

    (Imagine an element decl for 'root' if that makes you happy; the
    fatal error is not a "Element Valid" validity error which a user
    chose to treat as fatal.)

3 - The messy one ... having both a VC and WFC for "Entity Declared".
    I hope it's not controversial that the text there is a bit opaque!
    While I read Tim's "Annotated XML Spec" it didn't answer my issues.

    As a starting point, consider the following simplified language
    as the basis for textual improvements (separate statements for
    the "EntityRef" and "PEReference" case would be most clear):
	
	The [parameter or general] entity name in the [parameter/general]
	entity reference must match that in a [parameter/parsed general]
	entity declaration which was previously processed."

    Clearly that has none of the qualifiers that complicate the text
    now found in the spec ... but I'll propose that they all be removed,
    and moreover that only the WFC exist.  (The more I look at the way
    this is all specified, the more confusing it gets -- often a sign
    of a need for some powerful simplification!)


    (a) Consider this example:

	<!DOCTYPE root [ <!ELEMENT root EMPTY> %undeclared-pe; ]>
	<root/>

    One reading of the VC and WFC noted above is that neither one of
    them applies (so many qualifiers!) and that such a document is
    well formed and valid.  I don't think that should be; I think
    that such a document should clearly not be well formed.

    If it's intended that this violate either the VC or WFC, rewriting
    is needed to make this quite clear!  Bulleted lists are used
    elsewhere in the spec for such complex cases, and would help here
    (surely one of the most complex sets of qualifiers in this spec).


    (b) One "gotcha" in the simplification above is that the current
    WFC says that PEs must be declared before use ... even though the
    notation to the side of the construct implies that the WFC does
    not apply to PE references.  That seems like a copy/paste bug,
    in that it appears to turn all refs to undeclared PEs into
    WFC issues despite the existence of the VC.  (That is, undeclared
    PEs become fatal WF-ness errors!!)

    Related, the VC applies to the general entity reference syntax,
    and the qualifications to the WFC make the VC apply in the common
    case of an entity declared in an external PE ... refs to undeclared
    parsed general entities become (recoverable) VC errors!!  (Unless
    an interaction with the standalone declaration kicks in; see next.)

    Those results are counterintuitive, but are supported by the spec.


    (c) Another "gotcha" in that simplification is interaction with
    the "Standalone Document Declaration" VC.  In the case of a
    document which is invalid because it's declared as standalone,
    yet which still refers to an externally declared entity, the
    qualifications in the WFC say this should be upgraded to become
    a violation of this WFC.

    I think that the intent of the standalone decl was to facilitate
    safe processing of documents when ignoring external PEs, but the
    same case (undeclared externally defined entity) is discussed
    variously as a WFC error, or a violation of either of two VCs.

    Simpler would be to strike the clause of the "Standalone Document
    Declaration" VC that applies to entity references, perhaps noting
    that a WFC applies, and add a clause to the simplified "Entity
    Declared" WFC text above, something like:

	In the case of nonvalidating processors which do not read
	external parameter entities (the "external DTD subset"), and
	which are processing documents not marked "standalone='yes'",
	this WFC applies only to entity references preceding the first
	external PE that is not processed.

    (Of course that explicitly acknowledges that there are at least
    two subcategories of nonvalidating parser, based on whether they
    read external PEs or not.  That's evident, and I think it'd be
    good to call it out in the conformance section as well.)

    
    That's just a few highlights of what, for me, is a notable problem
    area in the specification.  As noted above, I think the best way
    through this (and related) issues is to just make entity declaration
    always be a WFC issue ... leaving nonvalidating parsers as they now
    stand, not guaranteed to report all such WF errors, although clearly
    stating when that may happen in a conformant manner.


Apologies in advance if parts of this are as unclear to you as those
parts of the spec are to me; this note was written to sumarize various
issue's we've queued up, and may not capture the discussion (leading
in particular to the "only have the WFC" suggestion) perfectly.

- Dave

Received on Monday, 25 January 1999 14:11:17 UTC