- From: Kent M Pitman <kmp@harlequin.com>
- Date: Fri, 24 Apr 98 14:26:48 EDT
- To: xml-editor@w3.org
- Cc: kmp@harlequin.com
I must record my PROFOUND personal[*] disappointment at the step backward that XML 1.0 took in the use of parameter entities over the way they were used in the draft specification. I hope this is a step you can yet reverse; please forward this message to whatever forum is appropriate for discussing possible future changes. I felt that the %foo notation used in the syntax description of XML (not to be confused with the parameter entity notation in XML/SGML itself) during its draft stages was a MAJOR step forward in the correction of my foremost gripe with SGML. I have spent a substantial part of my career writing parsers for a large variety of notations (NOT using automatic tools like YACC, but custom-crafted by hand). One thing that has always bothered me is the way in which SGML confuses "parsing" and "evaluation" and I had hoped XML would correct that. During its draft phase, it appeared to be headed in that direction. As nearly as I can tell, the final specification took a MAJOR step backward in this regard. For the purpose of this discussion, I define `parsing' as the interpretation of a sequence of characters as a structure. For example, the notation "foo" is `parsed' when it becomes a 3-character vector containing representations of the characters f, o, and o, in series rather than a 5-character vector containing representations of the characters ", f, o, o, and ", in series. An important feature of `parsing' as I have defined it here is that there is an inverse operation `unparsing' which yields the initial input. SGML, as nearly as I can discern from its very complex definition, does not have the property that it can be inverted by unparsing to the original notation. The key offender in this is parameter entities. (Comments and general entities also contribute mildly to the confusion, but in lesser ways that are fundamentally more tractable, so that solving only the parameter entity problem would represent, in itself, a major improvement to SGML.) For the purpose of this discussion, I define `evaluation' as any postprocessing which might optionally be done to an expression for some other purpose than correctly representing the input notation in a way that can be unparsed or postprocessed. The severity of the error may be plainly noted by the fact that typical SGML and HTML parsers yield either transformed SGML or ESIS, and that in either case, parameter entity information is lost. (Indeed, I am led slightly to question whether those few companies that have SGML editors consider this `error' in design to be a tactical commercial advantage because the language actively works against a parser design that would BOTH have the property of satisfying the desire to be an "editor" (to parse+edit+unparse) and to be a document representation system (to parse+postprocess). In order to satisfy the editor need, one needs NOT to expand parameter entity references or else you're left with them `edited out' after the first attempt to edit an SGML source text, so one immediately knows that to-ESIS processing or to other SGML processing (such as is done by NSGMLS) is not the path to writing an SGML editor, and one knows as a follow-on that there are no public tools which provide or can be transformed into SGML editors. What would it take to fix this problem? I claim that a simple constraint on the use of % would make things substantially more tractable, and that is the % usage that was allowed in XML during draft phases [preferrably MINUS the ability to say %(...) which I consider an abomination; I'll come back to that]. Why is % a problem? Because SGML is position dependent. A typical SGML markup construction might be <!FOO frotz glorp glarn?> where each of a frotz, a glorp, and a glarn have special parsing notations, and where you really have to know whether you're parsing a frotz or a glorp to do the right thing. So when you see <!FOO %zap; zoo> you don't know if the %zap; is going to expand into a single thing, the "zoo" is to be parsed as a glorp, and the glarn is missing, or whether the %zap; will expand into two things leaving "zoo" to be parsed as an optional (but present) glarn. This means you absolutely cannot construct a correct parsed representation of a FOO without knowing the value of parameter entities. And you can't know those values without post-processing previous parse expressions--perhaps including those objects you have not even finished parsing yet. That's really sad. Once you substitute the values, you might know the answer, but you only know it for one possible set of the INCLUDE/IGNORE settings, and you don't really know it generally. In XML drafts, this was fixed by requiring a % to yield up only the indicated token so you could define a <!FOO %frotz glorp glarn?> and then it would be clear that only a frotz could come out of the %zap; in the <!FOO %zap; zoo> we discussed earlier. That was enough to allow a parser to correctly parse the <!FOO ...> without knowing anything about the expansions. That allowed a printer to print back out a properly parsed entity with all of its parameter entities still unexpanded. That was enough power to write an XML editor that did not presuppose an expansion of a %. That was enough power to allow an extended version of ESIS which included % references. In Lisp terms, I wrote a function permit-% which worked this way in my parsers: (defun parse-foo (stream) (let ((frotz (permit-% #'parse-frotz stream)) (glorp (parse-glorp stream)) (glarn (parse-glarn stream))) (require-> stream) (make-foo :frotz frotz :glorp glorp :glarn glarn))) The permit-% was defined this way: (defclass parameter-entity-reference (xml-entity) ((name :type string :accessor name :initarg :name) (parser :type function :accessor parser :initarg :parser))) (defun permit-% (parser stream) (cond ((code= (peek-code-after-S stream) #\%) ;; Make a "closure" of the PEReference name over the parser ;; to be used, so that the parser can be called later if needed. (make-instance 'parameter-entity-reference :name (parse-PEReference stream) :parser parser)) (t ;Not a %, so just go ahead and parse the thing now. (funcall parser stream)))) Then later, if I *wanted* to the value of the PEReference, I could invoke the saved parser on the entity reference to get the parsed form that I would have gotten directly if the parsed stuff had been available. (defmethod parsed-replacement-text ((entity parameter-entity-reference)) (with-input-from-entity (instream (lookup-parameter-entity (name entity))) (let ((result (funcall (parser entity) instream))) (assure-S-to-eof instream) result))) This implementation worked "mostly fine" in the draft XML implementations, but the XML 1.0 notation goes back to using overpowerful % expansions with no constraint as to how many "things" can result from the %. That makes parsing hard in a way that I'm very, very sad about. [I say "mostly fine" because it had a few minor glitches which are addressed in suggestions (2) and (3) below.] To summarize, the main thing I want is: (1) that, as explained above, % should expand into text whose nature and type can be determined without seeing the expansion. There should not be cases where, like in SGML, you can't parse <!FOO %PPP; A B C> because you don't know how many tokens will result from %PPP; and so you don't know where to resume parsing. The two additional constraints on % I'd like to see are the following. These were not present even in the draft, and perhaps it was their absence that led people to believe that the %xxx notation was still overpowerful. Maybe that's why you backed out of it wholesale; I can't say. I think these restrictions on parameter entities, in BOTH the external and internal DTD subsets, would allow parameter entities to be used in a more clear and useful way in both cases: (2) That no % immediately inside another % is permissible in the syntax rules. That is, you can say: (a) [1] Foo ::= ( alpha | beta ) [2] Bar ::= %Foo and you can say: (b) [1] Foo ::= alpha beta [2] Bar ::= %Foo since both of these expand into a fixed and predictable number of things and so lead to a deterministic parse. But you should not ever have syntax rules: (c) [1] Foo ::= %alpha %beta [2] Bar ::= %Foo because if you do, then [3] Zap ::= '<!ZAP' S Bar beta > is too hard to parse. You can't tell whether <!ZAP %gunk;> is going to have %gunk correctly expand into an alpha and a beta (because of rule 2, or whether it's just going to expand into an alpha (because of rule 1) and a beta is going to be missing, or it's going to expand into a %beta (because of rule 1) which is wrong on its face. (3) As mentioned earlier on, I also would like never to see Foo ::= %( alpha | beta ) for purely linguistic reasons. I need to be able to separately name these things. So if a Foo is the thing that has a % in it, then the parameter-entity-reference can't be closed with :parser #'parse-foo since that would allow a % again. It needs to be :parser #'parse-a-or-b. And I hate making up those a-or-b kind of names. If a or b has a natural name, give it that name in the spec. Let THAT be the FOO. Put the % in the caller, as in (2)(a) above rather than [BAD]: (a) [1] Foo ::= %( alpha | beta ) [2] Bar ::= Foo which gives two names to a FOO but none to the OR of Alpha and Beta. I hope I've managed to make myself clear here. If I have not, I invite any questions that might help you understand more clearly why I'm disappointed by the present state of affairs or what it might take to appease me... well, and not just me. Hopefully I've made a reasonable case for why this is an issue a lot of people should care about. Thanks for your time and consideration. --Kent Pitman kmp@harlequin.com - - - - - - - - - - [*] DISCLAIMER: These remarks are mine personally and do not necessarily reflect the formal position of any organization (including my employer) with which I may be affiliated.
Received on Friday, 24 April 1998 14:23:31 UTC