XML 1.0 - What can parameter entities expand to? from Kent M Pitman on 1998-04-25 (xml-editor@w3.org from April to June 1998)

From: Kent M Pitman <kmp@harlequin.com>
Date: Sat, 25 Apr 98 01:44:14 EDT
To: xml-editor@w3.org
Cc: kmp@harlequin.com
Message-Id: <9804250544.AA04438@excel.harlequin.com>
In some places, for example, the part between the brackets of a doctypedecl
--and I can't tell you how annoyed I get that these things don't have names,
but I'll come back to that at the end--you permit PEReference as if this
were enough to say.  To me, it is not.

Can a PEReference expand into more than one markupdecl in that context?
In general, how much can a PEReference expand into?

If you're writing a parser, and you're at the '[' part, the next thing
you do is to enter a loop parsing things.  Well, hmm.  The loop looks like
on each iteration it might find "nothing" (S), a PEReference, or a markup
declaration.  Now, could a PEReference become more than one markup declaration?
To me, at this point in the code, I think the answer is NO.  If it could,
then why isn't the bnf:

   [a] ( ( markupdecl ( S markupdecl )* ) | S | PEReference )*

to alert me that at any iteration of the outer loop I might end up with
m ore than one markup declaration (the result of the PEReference).  Or is
it that the PEReference is not really part of the "syntax" but is simply
enabled in this context as part of a low-level stream expansion.  If so,
then the right thing to say is:


   [b] ( markupdecl | PS )* 

where PS is a space that might contain a parameter entity that needs to be
expanded.

Surely if I were to parse this thing as an SGML editor would (not expanding
the %foo;) it would be odd  because the %foo; would occupy a place
in the parse that was not syntactically appropriate.   I'd get back a
list of {{markupdecl} {S} {%foo;} {S}} and that would look ok but if I did
a substitution of its value
        {{markupdecl} {S} {{markupdecl} {markupdecl}} {S}}
that would not be appropriate to the BNF.

And yet, the only constraints offered on PEReferences where they're defined
says they have to start and stop in the same markup declaration as if 
they're perfectly well allowed to span multiple tokens.

I definitely think that the conservative thing is that when there's a list
like (foo | bar | PEReference) that the ONLY possible expansions of the
PEReference should be "foo" and "bar".  Otherwise, reshaping of the parse tree
later is a problem.  Nowhere that I've found is this constraint specified.
Perhaps I'm just overlooking it?  If not, perhaps it could be added.

And if it's supposed to be the case that more than one markup declaration
can appear, then I strongly encourage you to modify the bnf to accomodate
the truth of the hair that a parser must really endure.

In a sense, this appears to be a casualty of some last minute
transition from the old %xxx notation in the XML drafts to something
more like SGML.  But SGML uses two different kinds of "S" (S and PS).
It also uses the Ee (End of Entity) notation, which seems to be
missing here.  I can't help but think that that omission won't come
back to haunt you, since without marking where Ee's can occur, there
are also questions about where an entity can end--e.g., can it end 
mid-token.  For example, I understand the reason %foo;%bar; cannot merge
two tokens (e.g., if %FOO; turns to "foo" and %BAR; to "bar") forming a single
token (e.g., "foobar") is that the Ee is a PS and so is a token separator.
Without discussing this issue, and without including the SGML spec by
reference (something I hope you'll try steadfastly to do, since requiring 
people to read the SGML spec to handle XML will put XML *way* out of reach
of most people), the whole matter of PEReferences looks to be radically 
underconstrained.

- - - - 
[Returning to an issue I alluded to up top:]

Oh, and about that syntax for [28] doctypedecl.  I really do hate things
that are this complex without introducing additional names.  It forces me
to make up names in my hand-written parser, and it virtually assures my
made-up names won't match anyone else's.  And it makes it just plain hard
to talk about the syntax.   I think ALL languages, markup and programming,
should be defined in such a way that conversation about them is made simple
and practical.  I feel as if this language goes to very little trouble to
help in that regard.

In particular, there is a LOT of talk all through the document about the
internal and external DTD subset and yet when it comes to saying where those
things are, they are VERY hard to find for the uninitiated.  You look for them
in the syntax rules, and they are nowhere manifest.  I *assume* the external
DTD subset is what is named by the optional ExternalID in a doctypedecl.  Is
it? Can you find the word "internal DTD subset" in bold somewhere in the spec
where it is easy to see it is a defining reference? I can't.  How about
"external DTD subset"? Ditto.

I'd have written:

 [28] doctypedecl ::= '<!DOCTYPE' S Name ExtDTDref? IntDTDref? S? '>'

 [28.1] ExtDTDref ::= S ExternalID

 [28.2] IntDTDsubset ::= S? '[' markupdecls* ']'

 [28.3] markupdecls ::= ( PS | markupdecl )*



- - - - - 
DISCLAIMER:

 These are my personal feelings and not necessarily the official position
 of any company or organization that I may be affiliated with.
Received on Saturday, 25 April 1998 01:40:55 UTC