- From: Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de>
- Date: Mon, 24 Feb 2014 21:15:47 +0100
- To: Joe Gregorio <jcgregorio@google.com>, www-svg@w3.org
Joe Gregorio: > > OK, so I think part of the problem here is my use of the ambiguous terms > first-match vs longest-match. What we are really seeing here is a > difference between a PEG (parsing expression grammar) and a CFG (context > free grammar). > > http://en.wikipedia.org/wiki/Parsing_expression_grammar > http://en.wikipedia.org/wiki/Context-free_grammar > http://en.wikipedia.org/wiki/Comparison_of_parser_generators > > The BNF supplied is a valid CFG, parsers built for CFGs do backtracking, The SVG draft and the SVG recommendations note, that they use BNF, and what I learned from the wikipedia articles you cited: BNF is always of the CFG type, therefore there should be no surprise, that this PEG is not applicable just by identifying, that BNF is used. ... > > This isn't a theoretical problem, I am building a parser for SVG paths and > as a start > I copied the supplied BNF into a parser generator, unfortunately the parser > generator > I am using is a PEG and not a CFG parser generator: > If you use the wrong parser, it can be expected to get surprising results ;o) ... > https://github.com/dmajda/pegjs#expression1--expression2----expressionn > > I am apparently not the only one to get tripped up by this, as this parser > also has the same problem: > > https://www.npmjs.org/package/svg-path-parser > > If you change the following line in their example: > > , 'A25,35 -80 1,1 450,220 Z' > > to: > > , 'A25.0,35 -80 1,1 450,220 Z' > > then the parser will crash on what should be a valid path, since the 25 > gets > parsed as an"integer-constant" and leaves .0 on the input, > which fails to parse as anything in the grammar. Well, thats good, the crash indicates clearly, that the parser needs to be fixed to get it right for SVG path data... The recommendations note about this: 'The processing of the BNF must consume as much of a given BNF production as possible, stopping at the point when a character is encountered which no longer satisfies the production. Thus, in the string "M 100-200", the first coordinate for the "moveto" consumes the characters "100" and stops upon encountering the minus sign because the minus sign cannot follow a digit in the production of a "coordinate". The result is that the first coordinate will be "100" and the second coordinate will be "-200". Similarly, for the string "M 0.6.5", the first coordinate of the "moveto" consumes the characters "0.6" and stops upon encountering the second decimal point because the production of a "coordinate" only allows one decimal point. The result is that the first coordinate will be "0.6" and the second coordinate will be ".5".' Therefore clearly there is no ambiguity, in your case '25.0' still satifies the production of a coordinate, therefore it does not stop after '25'. '.0' is of the type fractional-constant, therefore 'M.0.0.0.0' creates no crash, but a path of length zero for example, to be presented effectively as a circle, if stroke-linecap round applies with finite size. For your example the trouble for your parser should appear either at the flag position (indicating maybe another problem of the parser; implementation notes prose, deviating from the BNF: 'Any nonzero value for either of the flags fA or fS is taken to mean the value 1.') or the last coordinate, because suddenly this is to much for A, respectively not enough for multiple elliptical arcs. As you can see, because there is already additional prose. The BNF produced content is not the only valid content, that needs to be consumed by a proper parser anyway... > > In this case I think the best solution is to rearrange the grammar > slightly as suggested, as it makes it valid for both PEG and CFG. You suggest not to use BNF for SVG 2 anymore? Or at least to use some advanced BNF? > The alternative would be to add a sentence to the spec that the BNF > is a CFG and not to be used in a PEG. This seems to be already the case, because it notes, that BNF is used and it does not note that the order of list items is of any importance, therefore nobody should expect, that it has any relevance? If I ask: 'do you want to 'pay 100Euro' or to 'get 50Cent'?', at least I do not expect, that everyone will prefer to 'pay 100Euro', just because I mention this first or the expression is longer ;o) Longer expression as for CSS would not be meaningful as well here, compare for example 0001000, 1000.0, +1.0E+003 But an additional informational hint for implementors can be of course helpful, if such an idea came already up, that the order implicates any preference (maybe we should meet concerning the pay/get question, if this matters for you in such 'or item lists' generally - or tell me the names of companies, that use such PEG decision makers for 'or item lists' ;o). I think, this notation only describes, what authors should chose to write and implementors can expect to get from a correct file (they should however expect worse as ok as mentioned in the implementation notes), this is not necessarily a suggestion for a parser, at least, this is not mentioned currently. Therefore in practice, to write a parser it should be no problem to rearrange the order of the 'or list items' to get a maybe more efficient parser work - but this is maybe still not sufficient to get it all. Maybe it is even more effective, for which path data fragments the difference exactly matters. If your parser has an abstraction, that covers both integer-constant and floating-point-constant - would it be a problem to replace this 'integer-constant | floating-point-constant' with one expression 'infloco' in your pseudo-BNF for the PEG parser? With some more optimisation maybe you can speed up this parser even more without doing something wrong - if you need to check flags anyway to differ from the BNF, maybe you can assume one univeral type 'realNumber' for all and check and fix later. If such drafts are written only for parsers, it will get soon a problem for authors to understand at all, what they can finally write. This BNF seems still simple enough to be read by humans ... Olaf
Received on Monday, 24 February 2014 20:16:18 UTC