- From: Norm Tovey-Walsh <norm@saxonica.com>
- Date: Tue, 18 Jan 2022 08:07:23 +0000
- To: ixml <public-ixml@w3.org>
- Message-ID: <m21r15saep.fsf@saxonica.com>
Hello, In order to parse ixml with my parser, I have to do a little bit of rule-rewriting. I need to turn optionality and repetition into new nonterminals so that all of the rules have the form: nonterminal: seqence, of, terminals, and, nonterminals. I also allow repeated definitions for the same nonterminal to represent alternatives. For example: chapter: para+. might become chapter: paraplus. -paraplus: para, parastar. -parastar: para, parastar. -parastar: para. (I did that rewrite in my head, I make no claims that it’s either optimal or even correct.) This is of no interest to the user, of course, because it all happens automatically. And it has no impact on the ixml output because the generated nonterminals are all marked “-”, they produce nothing. I say all of this only because, as a consequence of this rewriting, the “raw” results that I get from parsing an ixml grammar look a little weird. Nevertheless, for your amusement, in all it’s ~10,000 node glory, here is an SVG diagram of the shared packed parse forest (SPPF) of the text of the ixml grammar (with most occurrences of insiginficant whitespace removed) parsed with ixml: https://nwalsh.com/scratch/sppf/ixml.svg That’s sort of goofy and I doubt it’s useful. It take about 3.5 seconds for my implementation to produce the parse and it takes graphviz about 100 seconds to draw it. So I win. :-) Here’s a simple grammar for arithmetic expressions (in a slightly different BNF format, apologies): Sum => Sum, ["+";"-"], Product Sum => Product Product => Product, ["*";"/"], Factor Product => Factor Factor => '(', Sum, ')' Factor => Number Number => ["0"-"9"], Number Number => ["0"-"9"] The SPPF representation of a parse of “1+(2*3-4)” with that grammar is a little easier to read: https://nwalsh.com/scratch/sppf/sum.svg Each node in the graph is labelled with either a symbol or the partially completed Earley state that it represents. The numbers following the label are the range of characters that it spans. The extra intermediate nodes in the graph are added to limit its size. In my diagrams, nonterminals are oval, Earley states are rectangles, and terminals are houses. A graph of an ambiguous parse has nodes with overlapping ranges. A graph with loops is infinitely ambiguous. (At least, that’s my current understanding of how to interpret the graph; I could be wrong.) Be seeing you, norm -- Norm Tovey-Walsh Saxonica
Received on Tuesday, 18 January 2022 08:42:08 UTC