XML 1.0 - unconstrained use of %

I must record my PROFOUND personal[*] disappointment at the step backward
that XML 1.0 took in the use of parameter entities over the way they
were used in the draft specification.  I hope this is a step you can yet
reverse; please forward this message to whatever forum is appropriate for
discussing possible future changes.

I felt that the %foo notation used in the syntax description of XML (not
to be confused with the parameter entity notation in XML/SGML itself)
during its draft stages was a MAJOR step forward in the correction of my
foremost gripe with SGML.

I have spent a substantial part of my career writing parsers for a large
variety of notations (NOT using automatic tools like YACC, but custom-crafted
by hand).  One thing that has always bothered me is the way in which SGML
confuses "parsing" and "evaluation" and I had hoped XML would correct that.
During its draft phase, it appeared to be headed in that direction.  As 
nearly as I can tell, the final specification took a MAJOR step backward
in this regard.

For the purpose of this discussion, I define `parsing' as the interpretation 
of a sequence of characters as a structure.  For example, the notation
"foo" is `parsed' when it becomes a 3-character vector containing 
representations of the characters f, o, and o, in series rather than a 
5-character vector containing representations of the characters ", f, o, o, 
and ", in series.  An important feature of `parsing' as I have defined it
here is that there is an inverse operation `unparsing' which yields the
initial input.  SGML, as nearly as I can discern from its very complex
definition, does not have the property that it can be inverted by unparsing
to the original notation.  The key offender in this is parameter entities.
(Comments and general entities also contribute mildly to the confusion,
but in lesser ways that are fundamentally more tractable, so that solving
only the parameter entity problem would represent, in itself, a major
improvement to SGML.)

For the purpose of this discussion, I define `evaluation' as any
postprocessing which might optionally be done to an expression for some other
purpose than correctly representing the input notation in a way that can be
unparsed or postprocessed.

The severity of the error may be plainly noted by the fact that typical SGML
and HTML parsers yield either transformed SGML or ESIS, and that in either 
case, parameter entity information is lost.   (Indeed, I am led slightly
to question whether those few companies that have SGML editors consider this
`error' in design to be a tactical commercial advantage because the language
actively works against a parser design that would BOTH have the property of
satisfying the desire to be an "editor" (to parse+edit+unparse) and to be a
document representation system (to parse+postprocess).  In order to satisfy
the editor need, one needs NOT to expand parameter entity references or else
you're left with them `edited out' after the first attempt to edit an SGML
source text, so one immediately knows that to-ESIS processing or to other SGML
processing (such as is done by NSGMLS) is not the path to writing an SGML 
editor, and one knows as a follow-on that there are no public tools which 
provide or can be transformed into SGML editors.

What would it take  to fix this problem?  I claim that a simple constraint on
the use of % would make things substantially more tractable, and that is the
% usage that was allowed in XML during draft phases [preferrably MINUS the
ability to say %(...) which I consider an abomination; I'll come back to that].

Why is % a problem?  Because SGML is position dependent.  A typical SGML
markup construction might be <!FOO frotz glorp glarn?> where each of a frotz,
a glorp, and a glarn have special parsing notations, and where you really 
have to know whether you're parsing a frotz or a glorp to do the right thing.
So when you see <!FOO %zap; zoo> you don't know if the %zap; is going to expand
into a single thing, the "zoo" is to be parsed as a glorp, and the glarn is
missing, or whether the %zap; will expand into two things leaving "zoo" to
be parsed as an optional (but present) glarn.  This means you absolutely 
cannot construct a correct parsed representation of a FOO without knowing the
value of parameter entities.  And you can't know those values without 
post-processing previous parse expressions--perhaps including those objects
you have not even finished parsing yet.  That's really sad.  Once you 
substitute the values, you might know the answer, but you only know it for one
possible set of the INCLUDE/IGNORE settings, and you don't really know it
generally.

In XML drafts, this was fixed by requiring a % to yield up only the indicated
token so you could define a <!FOO %frotz glorp glarn?> and then it would be
clear that only a frotz could come out of the %zap; in the <!FOO %zap; zoo>
we discussed earlier.  That was enough to allow a parser to correctly parse
the <!FOO ...> without knowing anything about the expansions.  That allowed
a printer to print back out a properly parsed entity with all of its parameter
entities still unexpanded.  That was enough power to write an XML editor that
did not presuppose an expansion of a %.  That was enough power to allow an
extended version of ESIS which included % references.

In Lisp terms, I wrote a function permit-% which worked this way in my parsers:

  (defun parse-foo (stream)
    (let ((frotz (permit-% #'parse-frotz stream))
          (glorp (parse-glorp stream))
	  (glarn (parse-glarn stream)))
      (require-> stream)
      (make-foo :frotz frotz :glorp glorp :glarn glarn)))

The permit-% was defined this way:

  (defclass parameter-entity-reference (xml-entity)
    ((name   :type string   :accessor name   :initarg :name)
     (parser :type function :accessor parser :initarg :parser)))

  (defun permit-% (parser stream)
    (cond ((code= (peek-code-after-S stream) #\%)
	   ;; Make a "closure" of the PEReference name over the parser
	   ;; to be used, so that the parser can be called later if needed.
           (make-instance 'parameter-entity-reference
		          :name (parse-PEReference stream)
			  :parser parser))
	  (t ;Not a %, so just go ahead and parse the thing now.
	   (funcall parser stream))))

Then later, if I *wanted* to the value of the PEReference, I could invoke
the saved parser on the entity reference to get the parsed form that
I would have gotten directly if the parsed stuff had been available.

  (defmethod parsed-replacement-text ((entity parameter-entity-reference))
    (with-input-from-entity (instream (lookup-parameter-entity (name entity)))
      (let ((result (funcall (parser entity) instream)))
        (assure-S-to-eof instream)
        result)))

This implementation worked "mostly fine" in the draft XML implementations, but
the XML 1.0 notation goes back to using overpowerful % expansions with no
constraint as to how many "things" can result from the %.  That makes parsing
hard in a way that I'm very, very sad about.  [I say "mostly fine" because it
had a few minor glitches which are addressed in suggestions (2) and (3) below.]

To summarize,  the main thing I want is:

 (1) that, as explained above, % should expand into text whose nature
     and type can be determined without seeing the expansion.  There 
     should not be cases where, like in SGML, you can't parse
        <!FOO %PPP; A B C> 
     because you don't know how many tokens will result from %PPP; and
     so you don't know where to resume parsing.

The two additional constraints on % I'd like to see are the following.  These
were not present even in the draft, and perhaps it was their absence that led
people to believe that the %xxx notation was still overpowerful.  Maybe that's
why you backed out of it wholesale; I can't say.  I think these restrictions
on parameter entities, in BOTH the external and internal DTD subsets, would
allow parameter entities to be used in a more clear and useful way in both
cases:

 (2) That no % immediately inside another % is permissible in the syntax 
     rules.  That is, you can say:

     (a)  [1]  Foo ::= ( alpha | beta )
          [2]  Bar ::= %Foo

     and you can say:

     (b)  [1]  Foo ::= alpha beta
          [2]  Bar ::= %Foo

     since both of these expand into a fixed and predictable number of 
     things and so lead to a deterministic parse.   But you should not
     ever have syntax rules:

     (c)  [1]  Foo ::= %alpha %beta
          [2]  Bar ::= %Foo

     because if you do, then 

          [3]  Zap ::= '<!ZAP' S Bar beta >

     is too hard to parse.  You can't tell whether  <!ZAP %gunk;> is
     going to have %gunk correctly expand into an alpha and a beta (because
     of rule 2, or whether it's just going to expand into an alpha (because
     of rule 1) and a beta is going to be missing, or it's going to expand 
     into a %beta (because of rule 1) which is wrong on its face.

 (3) As mentioned earlier on, I also would like never to see 

     Foo ::= %( alpha | beta )

     for purely linguistic reasons.  I need to be able to separately name
     these things.  So if a Foo is the thing that has a % in it, then the
     parameter-entity-reference can't be closed with :parser #'parse-foo
     since that would allow a % again.  It needs to be :parser #'parse-a-or-b.
     And I hate making up those a-or-b kind of names.  If a or b has a 
     natural name, give it that name in the spec.  Let THAT be the FOO.  Put
     the % in the caller, as in (2)(a) above rather than [BAD]:

     (a)  [1]  Foo ::= %( alpha | beta )
          [2]  Bar ::= Foo

     which gives two names to a FOO but none to the OR of Alpha and Beta. 


I hope I've managed to make myself clear here.  If I have not, I invite any
questions that might help you understand more clearly why I'm disappointed by
the present state of affairs or what it might take to appease me...  well,
and not just me.  Hopefully I've made a reasonable case for why this is an
issue a lot of people should care about.

Thanks for your time and consideration.
 --Kent Pitman
   kmp@harlequin.com


- - - - - - - - - -
[*] DISCLAIMER:

 These remarks are mine personally and do not necessarily reflect the
 formal position of any organization (including my employer) with which
 I may be affiliated.

Received on Friday, 24 April 1998 14:23:31 UTC