Re: Some (minor) issues arising from testing a new IXML implementation

Regarding comments within whitespace: I ran into a similar issue with converting STAR files to XML, and as you mentioned, the difficulty is pronounced when attempting to round trip.  Comments typically need to be retained in document order to serve their purpose but ambiguities like you show below can disrupt that.

The example I provided for STAR is https://balisage.net/Proceedings/vol26/html/Gryk01/BalisageVol26-Gryk01.html#appendixB

Michael
________________________________
From: John Lumley <john@saxonica.com>
Sent: Wednesday, May 11, 2022 6:31 AM
To: ixml <public-ixml@w3.org>
Subject: Some (minor) issues arising from testing a new IXML implementation


During development and testing of my IXML implementation a few (minor) isssues have arisen that are worth recording. I have managed to run all the tests in the test suite with a minimum of 6 failures, so I've exercised much (most?) of the current IXML features. These issues, in no particular order are:

Tests assume execution on a Unix machine

Several of the tests have grammars that orient on a line-based structure and as such use -#a or ~[.... #a] terminals to (not)match a line ending. Unfortunately of course running on a Windows machine a line end in the test input would be represented by #d,#a. The pragmatic approach I took was to strip all carriage returns out of the input string within my test-driver. However perhaps the tests themselves should be altered to accomodate. At a cursory glance the test-sets/cases involved are:

  *   ambiguous: ambig4, css, lf2
  *   correct: address, diary, diary2, diary3, lf, para-test, vcard, ranges,

Comment retention within promiscuous use of whitespace

The current spec makes no use distinction between whitespace with and whitespace without embedded comments. As I have written my own IXML parser (rather than bootstrapping my Earley parser with an ixml.xml start state to parse the input ixml grammar) it has proven irksome to keep track of where comments are within deeper structures, so that any subsequent export of the ixml in XML format will contain the original comments in the original locations. Between rules is easy of course, and not too complex between sequence items, but there are some very ambiguous situations which may arise for any implementation attempting to 'round-trip' and input/serialise a grammar via the XML format.

For example consider the rule:

a : ^ "b" .

which should parse to:

<rule name="a">
    <literal tmark="^" string="b"/>
</rule>

Now consider the grammar definition of quoted:

-quoted: (tmark, s)?, string, s.

and some additional comments added to the input rule:

a {1} : {2} ^ {3} "b" {4} .


which might parse as:

<rule name="a">
    <comment>1</comment>
    <comment>2</comment>
    <literal tmark="^" string="b">
        <comment>4</comment>
    </literal>
</rule>


but where should comment 3 be stored? - perhaps it should be the first child of the literal, as the tmark that precedes it takes an attribute position and the trailing string also takes an attribute location. Comment 4 is certainly part of the literal as it appears in its trail s non-terminal. So now we have the conundrum - we cannot distinguish where the name terminating ':' character sat around comments 1 and 2 - the colon could have been before, between or after comments - the parse for all three would be identical, and similarly with comments 3 and 4 with respect to the (quoted) string characters.

I'm not saying this is critical, or even perhaps important, but it is a drawback and in part arises from some promiscuous permission of (comment including) whitespace, mostly between unitary pre-and post-fix operators and their operands (mark, repetition), which I frankly often find distracting




Miscellaneous

The test ixml/ixml-one-line needs whitespace after most of the rule-concluding periods, to meet the current spec.

--
John Lumley MA PhD CEng FIEE
john@saxonica.com<mailto:john@saxonica.com>

Received on Wednesday, 11 May 2022 18:02:11 UTC