Some (minor) issues arising from testing a new IXML implementation from John Lumley on 2022-05-11 (public-ixml@w3.org from May 2022)

From: John Lumley <john@saxonica.com>
Date: Wed, 11 May 2022 12:31:27 +0100
To: ixml <public-ixml@w3.org>
Message-ID: <89de514c-2742-bde2-bc31-9409ae6c46c8@saxonica.com>
During development and testing of my IXML implementation a few (minor) 
isssues have arisen that are worth recording. I have managed to run all 
the tests in the test suite with a minimum of 6 failures, so I've 
exercised much (most?) of the current IXML features. These issues, in no 
particular order are:


        Tests assume execution on a Unix machine

Several of the tests have grammars that orient on a line-based structure 
and as such use -#a or ~[.... #a] terminals to (not)match a line ending. 
Unfortunately of course running on a Windows machine a line end in the 
test input would be represented by #d,#a. The pragmatic approach I took 
was to strip all carriage returns out of the input string within my 
test-driver. However perhaps the tests themselves should be altered to 
accomodate. At a cursory glance the test-sets/cases involved are:

  * ambiguous: ambig4, css, lf2
  * correct: address, diary, diary2, diary3, lf, para-test, vcard, ranges,


        Comment retention within promiscuous use of whitespace

The current spec makes no use distinction between whitespace with and 
whitespace without embedded comments. As I have written my own IXML 
parser (rather than bootstrapping my Earley parser with an ixml.xml 
start state to parse the input ixml grammar) it has proven irksome to 
keep track of where comments are within deeper structures, so that any 
subsequent export of the ixml in XML format will contain the original 
comments in the original locations. Between rules is easy of course, and 
not too complex between sequence items, but there are some very 
ambiguous situations which may arise for any implementation attempting 
to 'round-trip' and input/serialise a grammar via the XML format.

For example consider the rule:

    a : ^ "b" .

which should parse to:

    <rule name="a">
         <literal tmark="^" string="b"/>
    </rule>

Now consider the grammar definition of quoted:

    -quoted: (tmark, s)?, string, s.

and some additional comments added to the input rule:

    a {1} : {2} ^ {3} "b" {4} .

which might parse as:

    <rule name="a">
         <comment>1</comment>
         <comment>2</comment>
         <literal tmark="^" string="b">
             <comment>4</comment>
         </literal>
    </rule>

but where should comment 3 be stored? - perhaps it should be the first 
child of the literal, as the tmark that precedes it takes an attribute 
position and the trailing string also takes an attribute location. 
Comment 4 is /certainly/ part of the literal as it appears in its trail 
s non-terminal. So now we have the conundrum - we cannot distinguish 
where the name terminating ':' character sat around comments 1 and 2 - 
the colon could have been before, between or after comments - the parse 
for all three would be identical, and similarly with comments 3 and 4 
with respect to the (quoted) string characters.

I'm not saying this is critical, or even perhaps important, but it is a 
drawback and in part arises from some promiscuous permission of (comment 
including) whitespace, mostly between unitary pre-and post-fix operators 
and their operands (mark, repetition), which I frankly often find 
distracting


        Miscellaneous

The test ixml/ixml-one-line needs whitespace after most of the 
rule-concluding periods, to meet the current spec.

-- 
*John Lumley* MA PhD CEng FIEE
john@saxonica.com
Received on Wednesday, 11 May 2022 11:32:05 UTC