- From: Fredrik Öhrström <oehrstroem@gmail.com>
- Date: Sat, 25 Jan 2025 12:07:39 +0100
- To: public-ixml@w3.org
- Message-ID: <CALZT+jALUVsG2Cr5PtmL9xYAxFEDYBcDqRWOdFCK7rKZwhhPGw@mail.gmail.com>
I would like to argue that ambiguity in IXML grammars can be very useful. I.e. we expect mutiple parse results and they are useful. The reason is that many languages are indeed ambiguous, but we are so used to hiding this fact, that we forgot. If we add the necessary features to IXML to deal with these languages already in the parser, then we will end up with a complex parser generator that will never stop growing in complexity because there is always something more to add. Take C for example, we have the dangling else statement, we have the builtin type (int) vs user declared type (mynum), we have the T**c which can be a multiplication of T and *c, or if it is a variable declaration of a pointer c of type T**, we have a+++b which could be (a++)+b or a+(++b) but the parser picks the longest match. The power of IXML is that we can generate XML, then continue to work on the XML using XSLT. Thus if we can preserve the ambiguity so that XSLT can work on it, then then we can resolve the ambiguities in XSLT. For example, we could drop the identification of the int keyword in the IXML parser and have all types be user defined types. The XSLT could detect that some are built in like int, and translate those nodes to built-in keyword nodes. The xmq tool uses the yaep parser (Yet Another Earley Parser) for IXML. The yeap parser written by Vladimir Makarov, is a very efficient implementation which minimizes memory by reusing data structures for new input tokens as much as possible: https://github.com/libxmq/xmq/blob/master/src/main/c/yaep/src/yaep.c To speed up the parse, yaep uses bitsets (arrays of int64_t:s) for testing wether it should complete/predict chart items. This is a "negative" lookahead, if it can see that it is impossible to proceed with the rule it will skip adding it. This includes a merge of the bitsets of its children! Assuming C has ~60 lexer token types, that would require 60 bits, which fits in a long int register/memory cell. But it can handle any number of bits, since it uses arrays of terminal_bitset_t (the typedef for long ints) Alas, when I was adapting yeap to parse unicode characters directly, the problem was that ~[] would translate into 154998 bits (the number of allocated unicode characters) which takes 18KiB per step in the parsing, that is a lot of memory per dotted rule, compared the the original 16 bytes or so. I was pondering how to modify the function complete_and_predict_new_state_set to have a special case for charsets, which is somewhat tricky, which is why this approached stalled. But then I came up with a better solution. The xmq tool pre-scans the input to collect exactly which unicode characters are used. Then it modifies the charset rules to test only for the unicode characters actually used! So if you parse a file with only A and B:s in it. (e.g. ABBABABAABBBBABA) then the grammar: content = [L]+. will translate into this yeap grammar structure: (use --verbose to see this intermediate grammar printed) content → -|0 -|0 → [L] -|0 → -|0 [L] -[L] → 'A' -[L] → 'B' where you see that the [L] charset only requires two bits in the bitset. A maps bit 1 and B maps to bit 2. Anyway, the yeap parsing is quick and it generates all possible parses, if you request so (add --ixml-all-parses) and it will insert AMBIGUOUS nodes in the generated tree. Store the below grammar as ambig.ixml program = word. word = int | variable. int = 'int'. variable = [Ll]+. xmq --ixml-all-parses --ixml=ambig.ixml -i "int" will print: program { AMBIGUOUS { word { variable = int } word { int = int } } } if you add to-xml you get: <program><AMBIGUOUS><word><variable>int</variable></word><word><int>int</int></word></AMBIGUOUS></program> You could send this DOM to XLST and have it detect the AMBIGUOUS nodes and handle them based on language specific knowledge. This would require the ixml specification to detail where AMBIGUOUS nodes are inserted. But since yeap was designed to parse C it already has a solution to resolve the ambiguous cases. It can assign a integer cost to each rule and then compose (serialize in ixml lingo) the tree with the least cost. I believe this is a know solution that has been tried before by Norm. I have added test-support for this to the xmq version 3.2.2. If you modify ambig.ixml with a "=<". The =< for the variable rule assigns the cost 1 to this rule. Rules without < have cost 0. You can do =<<< to assign the cost 3 etc. program = word. word = int | variable. int = 'int'. variable =< [Ll]+. Now if you run: xmq --ixml=ambig.ixml -i "int" it will print: FOUND smaller cost 1 word FOUND smaller cost 0 word ixml: Warning! The input can be parsed in multiple ways, ie it is ambiguous! program { word { int = int } } Dont worry about the debug/warning, they will be cleaned up. This is test-code. The point is that it will always pick the parse tree with the least cost. I think the cost solution which executes after the parsing has been performed is a better solution than explicit NOT prefix rule or EXCLUDE postfix rule. One reason is that both NOT and EXCLUDE might force you to maintain a separate list of rules that are less costly than the current. If this list gets out of sync with the original rules, you introduce a bug in the grammar. Also the size of the rule with higher costs grows with the number of rules to exclude. Whereas the cost assignment is just in a single location. Other use cases for a catch-all rule that is clearly ambiguous, since it by design, matches more specific rules, are when you decode a protocol where the framing of each command is known but the content is not. If you have a catch-all you can parse future versions of the protocol and discard commands that you do not know. Extending the grammar is easy. Store the following grammar as proto.ixml protocol = s, -framed_cmd++s, s. -s = -[Zs;#a;#d]*. -framed_cmd = -'(', cmd, -')'. -cmd = keep | send | unknown. keep = -'KEEP ', [N]+. send = -'SEND ', part++-','. part = [N]+. unknown =< ~[")"]*. xmq --ixml=proto.ixml -i "(KEEP 123) (SEND 1,2,3) (TRADE 7)" to-xml | xmllint --format - will print: <protocol> <keep>123</keep> <send> <part>1</part> <part>2</part> <part>3</part> </send> <unknown>TRADE 7</unknown> </protocol> If you drop the < cost marker to get an ixml grammar that works for ixampl, coffeepot, markup-blitz, they will generate the same output, by luck, since the grammar is ambiguous. However xmq based on yaep, without the cost marker you will get three unknowns instead, ie in this case the ambiguity is much more visible in xmq. Imagine having a protocol with 100 commands, then maintaining the NOT/EXCLUDE list would be rather annoying. In this protocol example, it makes sense to parse "keep" and "send" in the ixml grammar since they have internal structure that is easily parsed with ixml, instead of sending all commands as unknowns to XSLT for decoding. Here is a mind-bender that might give you inspiration! Are there other situations where we want to keep all ambiguous parses? Yes, I have played with parsing binary data in HEX format (eg 12440281FFC3932B... ) For byte aligned protocols where the whole byte, or a nybble is enough to determine the parse then this works fine. However often you do have meanings of bits inside the nybble. So how can you extract the bits? (You could write the grammar to work on strings with 01 (011010101101) instead of HEX, but it is a bit inefficient.) Lets add a new rule marker "*" to indicate that any AMBIGUOUS nodes below the rule should be dropped. I.e. we mark the rule to tell the grammar, yes I really want all interpretations below this node. Now the grammar below works with the latest xmq in github (you have to compile it). but again this is test code. Store it into bits.ixml. message = c_field. *c_field = c_hi, c_lo. c_hi>bit = hex_bit_none | hex_bit_1, +"FIVE" | hex_bit_2, +"SIX" | hex_bit_3, +"SEVEN" | hex_bit_4, +"EIGHT". c_lo>bit = hex_bit_none | hex_bit_1, +"ONE" | hex_bit_2, +"TWO" | hex_bit_3, +"THREE" | hex_bit_4, +"FOUR". -byte = hex, hex. -hex = ['0'-'9';'A'-'F']. { 0001 0011 0101 0111 1001 1011 1101 1111 } -hex_bit_1 = -['13579BDF']. { 0010 0011 0110 0111 1010 1011 1110 1111 } -hex_bit_2 = -['2367ABEF']. { 0100 0101 0110 0111 1100 1101 1110 1111 } -hex_bit_3 = -['4567CDEF']. { 1000 1001 1010 1011 1100 1101 1110 1111 } -hex_bit_4 = -['89ABCDEF']. -hex_bit_none = -'0', +"-". Run: xmq --ixml=bits.ixml -i "FF" will print message { c_field { bit = FIVE bit = SIX bit = SEVEN bit = EIGHT bit = ONE bit = TWO bit = THREE bit = FOUR } } because all 8 bits in the byte FF are set. Run: xmq --ixml=bits.ixml -i "C2" will print message { c_field { bit = SEVEN bit = EIGHT bit = TWO } } Ok, this is merely food for thought. I could just pass the c_field to the XSLT and extract the bits there, but it is interesting that a value can have multiple meanings and that a deterministic all-possible parses parse could be used to extract all possible meanings. Summary: Perhaps we should avoid using the word ambiguous for grammars where we really design multiple possible parse results. (user defined types in C, extensible protocols etc). I prefer the cost approach to select a single parse out of these multiple possible parses in IXML. It is already tested and easy to understand and the grammars are easier to maintain. For more complex parsing problens that cannot be handled with the cost alone, instead of trying to resolve the multiple parses in the IXML grammar directly, can we push the multiple possible parses to XSLT in a deterministic way and resolve the problem there?
Received on Saturday, 25 January 2025 11:08:10 UTC