A Case For Ambiguity (mutiple possible parse results) from Fredrik Öhrström on 2025-01-25 (public-ixml@w3.org from January 2025)

From: Fredrik Öhrström <oehrstroem@gmail.com>
Date: Sat, 25 Jan 2025 12:07:39 +0100
To: public-ixml@w3.org
Message-ID: <CALZT+jALUVsG2Cr5PtmL9xYAxFEDYBcDqRWOdFCK7rKZwhhPGw@mail.gmail.com>
I would like to argue that ambiguity in IXML grammars can be very useful.
I.e. we expect mutiple parse results and they are useful.

The reason is that many languages are indeed ambiguous, but we are so
used to hiding this fact, that we forgot.  If we add the
necessary features to IXML to deal with these languages already in the
parser,
then we will end up with a complex parser generator that will never
stop growing in complexity because there is always something more to
add.

Take C for example, we have the dangling else statement, we have the
builtin type (int) vs user declared type (mynum), we have the T**c
which can be a multiplication of T and *c, or if it is a variable
declaration of a pointer c of type T**, we have a+++b which could be
(a++)+b or a+(++b) but the parser picks the longest match.

The power of IXML is that we can generate XML, then continue to work
on the XML using XSLT. Thus if we can preserve the ambiguity so that
XSLT can work on it, then then we can resolve the ambiguities in XSLT.

For example, we could drop the identification of the int keyword in the
IXML parser
and have all types be user defined types. The XSLT could detect that some
are
built in like int, and translate those nodes to built-in keyword nodes.

The xmq tool uses the yaep parser (Yet Another Earley Parser) for IXML.
The yeap parser written by Vladimir Makarov, is a very efficient
implementation
which minimizes memory by reusing data structures for new input tokens as
much as possible:
https://github.com/libxmq/xmq/blob/master/src/main/c/yaep/src/yaep.c

To speed up the parse, yaep uses bitsets (arrays of int64_t:s) for
testing wether it should complete/predict chart items. This is a
"negative" lookahead, if it can see that it is impossible to proceed
with the rule it will skip adding it. This includes a merge of the
bitsets of its children!

Assuming C has ~60 lexer token types, that would require 60 bits,
which fits in a long int register/memory cell.  But it can handle any
number of bits, since it uses arrays of terminal_bitset_t (the typedef
for long ints)

Alas, when I was adapting yeap to parse unicode characters directly,
the problem was that ~[] would translate into 154998 bits (the number
of allocated unicode characters) which takes 18KiB per step in the
parsing, that is a lot of memory per dotted rule, compared the the original
16 bytes or so.

I was pondering how to modify the function
complete_and_predict_new_state_set
to have a special case for charsets, which is somewhat tricky, which is why
this approached stalled.

But then I came up with a better solution.

The xmq tool pre-scans the input to collect exactly which unicode
characters are used.  Then it modifies the charset rules to test only
for the unicode characters actually used!

So if you parse a file with only A and B:s in it. (e.g. ABBABABAABBBBABA)
then the grammar:

content = [L]+.

will translate into this yeap grammar structure:
(use --verbose to see this intermediate grammar printed)

content →  -|0
-|0 →   [L]
-|0 →  -|0  [L]
-[L] →   'A'
-[L] →   'B'

where you see that the [L] charset only requires two bits in the bitset.
A maps bit 1 and B maps to bit 2.

Anyway, the yeap parsing is quick and it generates all possible
parses, if you request so (add --ixml-all-parses) and it will insert
AMBIGUOUS nodes in the generated tree. Store the below grammar as
ambig.ixml

program = word.
word = int | variable.
int = 'int'.
variable = [Ll]+.

xmq --ixml-all-parses --ixml=ambig.ixml -i "int"
will print:
program {
    AMBIGUOUS {
        word {
            variable = int
        }
        word {
            int = int
        }
    }
}
if you add to-xml you get:
<program><AMBIGUOUS><word><variable>int</variable></word><word><int>int</int></word></AMBIGUOUS></program>

You could send this DOM to XLST and have it detect the AMBIGUOUS nodes
and handle them based on language specific knowledge. This would require
the ixml
specification to detail where AMBIGUOUS nodes are inserted.

But since yeap was designed to parse C it already has a solution to resolve
the ambiguous cases. It can assign a integer cost to each rule and
then compose (serialize in ixml lingo) the tree with the least
cost. I believe this is a know solution that has been tried before by Norm.

I have added test-support for this to the xmq version 3.2.2. If you
modify ambig.ixml with a "=<". The =< for the variable rule assigns the
cost 1
to this rule. Rules without < have cost 0. You can do =<<< to assign the
cost 3 etc.

program = word.
word = int | variable.
int = 'int'.
variable =< [Ll]+.

Now if you run:

xmq --ixml=ambig.ixml -i "int"

it will print:
FOUND smaller cost 1 word
FOUND smaller cost 0 word
ixml: Warning! The input can be parsed in multiple ways, ie it is ambiguous!
program {
    word {
        int = int
    }
}

Dont worry about the debug/warning, they will be cleaned up. This is
test-code.  The point is that it will always pick the parse tree with
the least cost.

I think the cost solution which executes after the parsing has been
performed is a better solution than explicit NOT prefix rule or
EXCLUDE postfix rule.

One reason is that both NOT and EXCLUDE might force you to maintain a
separate list of rules that are less costly than the current.

If this list gets out of sync with the original rules, you introduce a
bug in the grammar. Also the size of the rule with higher costs grows
with the number of rules to exclude. Whereas the cost assignment is
just in a single location.

Other use cases for a catch-all rule that is clearly ambiguous, since
it by design, matches more specific rules, are when you decode a
protocol where the framing of each command is known but the content is
not. If you have a catch-all you can parse future versions of the protocol
and discard commands that you do not know. Extending the grammar is easy.
Store the following grammar as proto.ixml

protocol = s, -framed_cmd++s, s.
-s = -[Zs;#a;#d]*.
-framed_cmd = -'(', cmd, -')'.
-cmd = keep | send | unknown.
keep = -'KEEP ', [N]+.
send = -'SEND ', part++-','.
part = [N]+.
unknown =< ~[")"]*.

xmq --ixml=proto.ixml -i "(KEEP 123) (SEND 1,2,3) (TRADE 7)" to-xml |
xmllint --format -

will print:

<protocol>
  <keep>123</keep>
  <send>
    <part>1</part>
    <part>2</part>
    <part>3</part>
  </send>
  <unknown>TRADE 7</unknown>
</protocol>

If you drop the < cost marker to get an ixml grammar that works for
ixampl, coffeepot, markup-blitz, they will generate the same output, by
luck, since the grammar is ambiguous.  However xmq based on yaep, without
the cost marker you will get three unknowns instead, ie in this case
the ambiguity is much more visible in xmq.

Imagine having a protocol with 100 commands, then maintaining the
NOT/EXCLUDE list would be rather annoying.

In this protocol example, it makes sense to parse "keep" and "send" in
the ixml grammar since they have internal structure that is easily
parsed with ixml, instead of sending all commands as unknowns to XSLT
for decoding.

Here is a mind-bender that might give you inspiration! Are there
other situations where we want to keep all ambiguous parses?

Yes, I have played with parsing binary data in HEX format (eg
12440281FFC3932B... ) For byte aligned protocols where the whole byte,
or a nybble is enough to determine the parse then this works fine.

However often you do have meanings of bits inside the nybble.  So how
can you extract the bits? (You could write the grammar to work on
strings with 01 (011010101101) instead of HEX, but it is a bit
inefficient.)

Lets add a new rule marker "*" to indicate that any AMBIGUOUS nodes
below the rule should be dropped. I.e. we mark the rule to tell the
grammar, yes I really want all interpretations below this node.
Now the grammar below works with the latest xmq in github (you have to
compile it).
but again this is test code. Store it into bits.ixml.

message = c_field.
*c_field = c_hi, c_lo.
c_hi>bit = hex_bit_none |
           hex_bit_1, +"FIVE" |
           hex_bit_2, +"SIX" |
           hex_bit_3, +"SEVEN" |
           hex_bit_4, +"EIGHT".
c_lo>bit = hex_bit_none |
           hex_bit_1, +"ONE" |
           hex_bit_2, +"TWO" |
           hex_bit_3, +"THREE" |
           hex_bit_4, +"FOUR".
-byte = hex, hex.
-hex = ['0'-'9';'A'-'F'].
{ 0001 0011 0101 0111 1001 1011 1101 1111 }
-hex_bit_1 = -['13579BDF'].
{ 0010 0011 0110 0111 1010 1011 1110 1111 }
-hex_bit_2 = -['2367ABEF'].
{ 0100 0101 0110 0111 1100 1101 1110 1111 }
-hex_bit_3 = -['4567CDEF'].
{ 1000 1001 1010 1011 1100 1101 1110 1111 }
-hex_bit_4 = -['89ABCDEF'].
-hex_bit_none = -'0', +"-".

Run: xmq --ixml=bits.ixml -i "FF"

will print

message {
    c_field {
        bit = FIVE
        bit = SIX
        bit = SEVEN
        bit = EIGHT
        bit = ONE
        bit = TWO
        bit = THREE
        bit = FOUR
    }
}

because all 8 bits in the byte FF are set.

Run: xmq --ixml=bits.ixml -i "C2"

will print

message {
    c_field {
        bit = SEVEN
        bit = EIGHT
        bit = TWO
    }
}

Ok, this is merely food for thought. I could just pass the c_field to
the XSLT and extract the bits there, but it is interesting that a
value can have multiple meanings and that a deterministic all-possible
parses parse could be used to extract all possible meanings.

Summary:

Perhaps we should avoid using the word ambiguous for grammars where we
really design multiple possible parse results. (user defined types in C,
extensible protocols etc).

I prefer the cost approach to select a single parse out of these multiple
possible parses in IXML.  It is already tested and easy to understand and
the
grammars are easier to maintain.

For more complex parsing problens that cannot be handled with the cost
alone, instead of trying to resolve the multiple parses in the IXML
grammar directly, can we push the multiple possible parses to XSLT in
a deterministic way and resolve the problem there?
Received on Saturday, 25 January 2025 11:08:10 UTC