Re: [Bug 24769] New: SVG Path BNF is ambigious from Dr. Olaf Hoffmann on 2014-02-24 (www-svg@w3.org from February 2014)

From: Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de>
Date: Mon, 24 Feb 2014 21:15:47 +0100
To: Joe Gregorio <jcgregorio@google.com>, www-svg@w3.org
Message-Id: <201402242115.47978.Dr.O.Hoffmann@gmx.de>
Joe Gregorio:

>
> OK, so I think part of the problem here is my use of the ambiguous terms
> first-match vs longest-match. What we are really seeing here is a
> difference between a PEG (parsing expression grammar) and a CFG (context
> free grammar).
>
> http://en.wikipedia.org/wiki/Parsing_expression_grammar
> http://en.wikipedia.org/wiki/Context-free_grammar
> http://en.wikipedia.org/wiki/Comparison_of_parser_generators
>
> The BNF supplied is a valid CFG, parsers built for CFGs do backtracking,

The SVG draft and the SVG recommendations note, that they use BNF,
and what I learned from the wikipedia articles you cited: BNF is always of
the CFG type, therefore there should be no surprise, that this PEG is
not applicable just by identifying, that BNF is used.

...
>
> This isn't a theoretical problem, I am building a parser for SVG paths and
> as a start
> I copied the supplied BNF into a parser generator, unfortunately the parser
> generator
> I am using is a PEG and not a CFG parser generator:
>

If you use the wrong parser, it can be expected to get surprising results ;o)

...
>   https://github.com/dmajda/pegjs#expression1--expression2----expressionn
>
> I am apparently not the only one to get tripped up by this, as this parser
> also has the same problem:
>
>   https://www.npmjs.org/package/svg-path-parser
>
> If you change the following line in their example:
>
>     , 'A25,35 -80 1,1 450,220 Z'
>
> to:
>
>     , 'A25.0,35 -80 1,1 450,220 Z'
>
> then the parser will crash on what should be a valid path, since the 25
> gets
> parsed as an"integer-constant" and leaves .0 on the input,
> which fails to parse as anything in the grammar.

Well, thats good, the crash indicates clearly, that the parser needs to be 
fixed to get it right for SVG path data...

The recommendations note about this:
'The processing of the BNF must consume as much of a given BNF production as 
possible, stopping at the point when a character is encountered which no 
longer satisfies the production. Thus, in the string "M 100-200", the first 
coordinate for the "moveto" consumes the characters "100" and stops upon 
encountering the minus sign because the minus sign cannot follow a digit in 
the production of a "coordinate". The result is that the first coordinate 
will be "100" and the second coordinate will be "-200".

Similarly, for the string "M 0.6.5", the first coordinate of the "moveto" 
consumes the characters "0.6" and stops upon encountering the second decimal 
point because the production of a "coordinate" only allows one decimal point. 
The result is that the first coordinate will be "0.6" and the second 
coordinate will be ".5".'

Therefore clearly there is no ambiguity, in your case
'25.0' still satifies the production of a coordinate, therefore it does not 
stop after '25'.

'.0' is of the type fractional-constant, therefore
'M.0.0.0.0' creates no crash, but a path of length zero for example, to
be presented effectively as a circle, if stroke-linecap round  applies
with finite size.
For your example the trouble for your parser should appear
either at the flag position (indicating maybe another problem of the
parser; implementation notes prose, deviating from the BNF: 
'Any nonzero value for either of the flags fA or fS is taken to 
mean the value 1.') or the last coordinate, because suddenly 
this is to much for A, respectively not enough for multiple elliptical arcs.

As you can see, because there is already additional prose.
The BNF produced content is not the only valid content, 
that needs to be consumed by a proper parser anyway...


>
> In this case I think the best solution is to rearrange the grammar
> slightly as suggested, as it makes it valid for both PEG and CFG.

You suggest not to use BNF for SVG 2 anymore?
Or at least to use some advanced BNF?

> The alternative would be to add a sentence to the spec that the BNF
> is a CFG and not to be used in a PEG.

This seems to be already the case, because it notes, that BNF is
used and it does not note that the order of  list items is of any
importance, therefore nobody should expect, that it has any relevance?
If I ask: 'do you want to 'pay 100Euro' or  to 'get 50Cent'?', at least I do 
not expect, that everyone will prefer to 'pay 100Euro', just because I 
mention this first or the expression is longer ;o) Longer expression as 
for CSS would not be meaningful as well here, compare for example 
0001000, 1000.0, +1.0E+003

But an additional informational hint for implementors can be of course
helpful, if such an idea came already up, that the order implicates any
preference (maybe we should meet concerning the pay/get question,
if this matters for you in such 'or item lists' generally - or tell me the 
names of companies, that use such PEG decision makers for  
'or item lists' ;o).

I think, this notation only describes, what authors should chose to write
and implementors can expect to get from a correct file (they should however
expect worse as ok as mentioned in the implementation notes), this is not 
necessarily a suggestion for a parser, at least, this is not mentioned 
currently.
Therefore in practice, to write a parser it should be no problem to
rearrange the order of the 'or list items' to get a maybe more efficient
parser work - but this is maybe still not sufficient to get it all.
Maybe it is even more effective, for which path data fragments the
difference exactly matters. If your parser has an abstraction, that
covers both integer-constant and floating-point-constant - would it
be a problem to replace this 'integer-constant | floating-point-constant' 
with one expression 'infloco' in your pseudo-BNF for the PEG parser?
With some more optimisation maybe you can speed up this parser
even more without doing something wrong - if you need to check
flags anyway to differ from the BNF, maybe you can assume one
univeral type 'realNumber' for all and check and fix later.

If such drafts are written only for parsers, it will get soon a problem
for authors to understand at all, what they can finally write.
This BNF seems still simple enough to be read by humans ...


Olaf
Received on Monday, 24 February 2014 20:16:18 UTC