Minor error/defect in the published XML grammar

I've had a chat with Tim Bray over this.  I'll post you the first of two
"errors" I detected in the xml spec on http://www.w3.org/TR/REC-xml when
I generated a parser automatically from the grammar on the html page.
I'll post the second if we reach agreement about this one!


==================
There's an interesting defect in http://www.w3.org/TR/REC-xml.
Look at the two productions:

[43] content ::= (element | CharData | Reference | CDSect | PI | Comment)*
[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)        

You'll see that [14] matches epsilon (i.e.  the empty match) and [43]
allows it to be repeated arbitrarily often.  This should lead to an
infinite loop on a parser generated by a compiler compiler that has full
Turing complexity, since it can't guarantee to detect productions that
may reduce to epsilon ahead of time :-(.

Since [43] is the only place that CharData is referenced, I suppose
you meant a + instead of a * in [14]. Thus:

[14] CharData ::= [^<&]+ - ([^<&]* ']]>' [^<&]*)
==================

Talking with Tim, he tells me that (my conclusions after the elipsis)

a) most authors built hand generated parsers, so ... they might be avoiding
   this problem without thinking about it, since anyone good enough to
   do the former is good enough to do the latter
b) most authors built recursive descent parsers, so ... they ought to
   see the problem. They can't avoid it. Only bottom-up parsers should
   be able to avoid the interpretation of the infinite union of empty
   sets as the unending computation.
c) it apparantly hasn't surfaced as a problem in practice.

I believe I saw the problem because I got wrong at the first attempt
the definition of some of the  token matches (I didn't notice you were
using ^ to mean complement). That lead to a lot of failed matches on x
in (x*)*, and that is a successful match on nothing for x*, repeated
forever. Try putting an illegal character where you expect CharData
in content. That ought to cause the same effect in a current parser.

I haven't tried it ..  nor have I tried any of the other parsers out
there.  I'm just pointing out the fact that the grammar presentation has
a defect as it stands, and the defect is easily corrected. I think it should
be. You've gone to considerable trouble in other areas of the grammar
not to let ground productions match epsilon. CharData is notionally
a character, at least, not nothing.

I have generated a parser from the corrected grammar, and the parser
works on the xml examples I've tried. I'd be grateful for a test suite!

There is another defect of this kind. It occurs in rules 63,64,65.
I haven't looked beyond these two.


-- 
Peter T. Breuer                   MA CASM PhD. Ing., Asoc. Prof. 
Area de Ingenieria Telematica	  E-mail: ptb@it.uc3m.es
Dpto. Ingenieria		  Tel: +34 91 624 91 80
Universidad Carlos III de Madrid  Fax: +34 91 624 94 30/65
Butarque 15, E-28911 Leganes      URL: http://www.it.uc3m.es/~ptb
Spain

Received on Sunday, 16 April 2000 06:12:55 UTC