W3C home > Mailing lists > Public > public-xml-core-wg@w3.org > December 2011

Re: FW: XML grammar error?

From: John Cowan <cowan@mercury.ccil.org>
Date: Mon, 12 Dec 2011 11:16:59 -0500
To: "Grosso, Paul" <pgrosso@ptc.com>
Cc: public-xml-core-wg@w3.org
Message-ID: <20111212161658.GA25263@mercury.ccil.org>
bacchi raffaele scripsit:

> I think that rule [20] (and other similar) are wrong:
>
> CData ::= (Char* - (Char* ']]>' Char*))
>
> The purpose of the rule is to match (reduce) any Char sequence not
> containing ']]>'.  But this result is not achieved since the Char
> definition includes ']' and '>' so the exception part of the rule:
>
> -(Char* ']]>' Char*)
>
> is ambiguous. Most parsers solve the ambiguity by applying the rule
> "reduce as soon, as much as possible" thus the rule will always
> mismatch because the first Char* reduces also the sequence ']]>'
> and the next terminal ']]>' will never match.

There is no ambiguity here.  A - B matches if A matches, provided B does
not also match what A matches.  The regular expression (in conventional
notation) /^*]]>*$/ matches any string that contains at least one ']]>'.
It is ambiguous in the sense that if there are multiple tokens of ']]>'
in the string, different matchers will match ']]>' in the pattern against
the first or the last.  But that makes no difference to the meaning of
the pattern.

Specifically, a leftmost-longest matcher will first match the first
Char* against the whole string, then attempt to match ']' and fail.
It will then reduce the Char* by one character and try again to match
']'.  Iff there is a ']]>' in the string, it will eventually be matched
as a result of the shortening of the first Char*; the second Char* will
then match whatever is left.  If there is more than one, the rightmost
will be the one that matches.

Per contra, a DFA matcher will match the leftmost occurrence of ']]>'.
But as stated, exactly which ']]>' is matched is simply irrelevant.

> I think the rule (and other similar) should be written:
> 
> Cdata ::= ( Char - ']]>' )*

This, of course, is nonsense, since it says to match a single character
which is not a three-character sequence.  No single character can be three
characters, so it will match every character.

-- 
Man has no body distinct from his soul,              John Cowan
for that called body is a portion of the soul        cowan@ccil.org
discerned by the five senses,                        http://www.ccil.org/~cowan
the chief inlets of the soul in this age.  --William Blake
Received on Monday, 12 December 2011 16:17:28 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:16:43 UTC