RE: XML grammar error? from Grosso, Paul on 2011-12-21 (xml-editor@w3.org from October to December 2011)

From: Grosso, Paul <pgrosso@ptc.com>
Date: Wed, 21 Dec 2011 11:44:13 -0500
To: "bacchi raffaele" <bacchi_raffaele@lycos.com>, <xml-editor@w3.org>
Message-ID: <9B2DE9094C827E44988F5ADAA6A2C5DA04503B3D@HQ-MAIL9.ptcnet.ptc.com>



> -----Original Message-----
> From: bacchi raffaele [mailto:bacchi_raffaele@lycos.com]
> Sent: Monday, 2011 December 12 3:45
> To: xml-editor@w3.org
> Subject: XML grammar error?
> 
> Hi,
> I think that rule [20] (and other similar) are wrong:
> CData ::= (Char* - (Char* ']]>' Char*))
> The purpose of the rule is to match (reduce) any Char sequence not
> containing ']]>'.
> But this result is not achieved since the Char definition includes ']'
> and '>' so the exception part of the rule:
> -(Char* ']]>' Char*)
> is ambiguous. Most parsers solve the ambiguity by applying the rule
> "reduce as soon, as much as possible"
> thus the rule will always mismatch because the first Char* reduces also
> the sequence ']]>' and the next terminal ']]>' will never match.


There is no ambiguity here.  A - B matches if A matches, provided B does
not also match what A matches.  The regular expression (in conventional
notation) /^.*]]>.*$/ matches any string that contains at least one ']]>'.
It is ambiguous in the sense that if there are multiple tokens of ']]>'
in the string, different matchers will match ']]>' in the pattern against
the first or the last.  But that makes no difference to the meaning of
the pattern.

Specifically, a leftmost-longest matcher will first match the first
Char* against the whole string, then attempt to match ']' and fail.
It will then reduce the Char* by one character and try again to match
']'.  Iff there is a ']]>' in the string, it will eventually be matched
as a result of the shortening of the first Char*; the second Char* will
then match whatever is left.  If there is more than one, the rightmost
will be the one that matches.

By way of contrast, a DFA matcher will match the leftmost occurrence 
of ']]>'.  But as stated, exactly which ']]>' is matched is irrelevant.


> I think the rule (and other similar) should be written:
> Cdata ::= ( Char - ']]>' )*

This will not work since it says to match a single character which is 
not a three-character sequence.  No single character can be three
characters, so it will match every character.


Paul Grosso
for the XML Core WG

Received on Wednesday, 21 December 2011 16:44:34 UTC