- From: Peter T. Breuer <ptb@it.uc3m.es>
- Date: Sun, 16 Apr 2000 12:11:04 +0200 (MET DST)
- To: xml-editor@w3.org
I've had a chat with Tim Bray over this. I'll post you the first of two "errors" I detected in the xml spec on http://www.w3.org/TR/REC-xml when I generated a parser automatically from the grammar on the html page. I'll post the second if we reach agreement about this one! ================== There's an interesting defect in http://www.w3.org/TR/REC-xml. Look at the two productions: [43] content ::= (element | CharData | Reference | CDSect | PI | Comment)* [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*) You'll see that [14] matches epsilon (i.e. the empty match) and [43] allows it to be repeated arbitrarily often. This should lead to an infinite loop on a parser generated by a compiler compiler that has full Turing complexity, since it can't guarantee to detect productions that may reduce to epsilon ahead of time :-(. Since [43] is the only place that CharData is referenced, I suppose you meant a + instead of a * in [14]. Thus: [14] CharData ::= [^<&]+ - ([^<&]* ']]>' [^<&]*) ================== Talking with Tim, he tells me that (my conclusions after the elipsis) a) most authors built hand generated parsers, so ... they might be avoiding this problem without thinking about it, since anyone good enough to do the former is good enough to do the latter b) most authors built recursive descent parsers, so ... they ought to see the problem. They can't avoid it. Only bottom-up parsers should be able to avoid the interpretation of the infinite union of empty sets as the unending computation. c) it apparantly hasn't surfaced as a problem in practice. I believe I saw the problem because I got wrong at the first attempt the definition of some of the token matches (I didn't notice you were using ^ to mean complement). That lead to a lot of failed matches on x in (x*)*, and that is a successful match on nothing for x*, repeated forever. Try putting an illegal character where you expect CharData in content. That ought to cause the same effect in a current parser. I haven't tried it .. nor have I tried any of the other parsers out there. I'm just pointing out the fact that the grammar presentation has a defect as it stands, and the defect is easily corrected. I think it should be. You've gone to considerable trouble in other areas of the grammar not to let ground productions match epsilon. CharData is notionally a character, at least, not nothing. I have generated a parser from the corrected grammar, and the parser works on the xml examples I've tried. I'd be grateful for a test suite! There is another defect of this kind. It occurs in rules 63,64,65. I haven't looked beyond these two. -- Peter T. Breuer MA CASM PhD. Ing., Asoc. Prof. Area de Ingenieria Telematica E-mail: ptb@it.uc3m.es Dpto. Ingenieria Tel: +34 91 624 91 80 Universidad Carlos III de Madrid Fax: +34 91 624 94 30/65 Butarque 15, E-28911 Leganes URL: http://www.it.uc3m.es/~ptb Spain
Received on Sunday, 16 April 2000 06:12:55 UTC