- From: John Lumley <john@saxonica.com>
- Date: Wed, 11 May 2022 12:31:27 +0100
- To: ixml <public-ixml@w3.org>
- Message-ID: <89de514c-2742-bde2-bc31-9409ae6c46c8@saxonica.com>
During development and testing of my IXML implementation a few (minor)
isssues have arisen that are worth recording. I have managed to run all
the tests in the test suite with a minimum of 6 failures, so I've
exercised much (most?) of the current IXML features. These issues, in no
particular order are:
Tests assume execution on a Unix machine
Several of the tests have grammars that orient on a line-based structure
and as such use -#a or ~[.... #a] terminals to (not)match a line ending.
Unfortunately of course running on a Windows machine a line end in the
test input would be represented by #d,#a. The pragmatic approach I took
was to strip all carriage returns out of the input string within my
test-driver. However perhaps the tests themselves should be altered to
accomodate. At a cursory glance the test-sets/cases involved are:
* ambiguous: ambig4, css, lf2
* correct: address, diary, diary2, diary3, lf, para-test, vcard, ranges,
Comment retention within promiscuous use of whitespace
The current spec makes no use distinction between whitespace with and
whitespace without embedded comments. As I have written my own IXML
parser (rather than bootstrapping my Earley parser with an ixml.xml
start state to parse the input ixml grammar) it has proven irksome to
keep track of where comments are within deeper structures, so that any
subsequent export of the ixml in XML format will contain the original
comments in the original locations. Between rules is easy of course, and
not too complex between sequence items, but there are some very
ambiguous situations which may arise for any implementation attempting
to 'round-trip' and input/serialise a grammar via the XML format.
For example consider the rule:
a : ^ "b" .
which should parse to:
<rule name="a">
<literal tmark="^" string="b"/>
</rule>
Now consider the grammar definition of quoted:
-quoted: (tmark, s)?, string, s.
and some additional comments added to the input rule:
a {1} : {2} ^ {3} "b" {4} .
which might parse as:
<rule name="a">
<comment>1</comment>
<comment>2</comment>
<literal tmark="^" string="b">
<comment>4</comment>
</literal>
</rule>
but where should comment 3 be stored? - perhaps it should be the first
child of the literal, as the tmark that precedes it takes an attribute
position and the trailing string also takes an attribute location.
Comment 4 is /certainly/ part of the literal as it appears in its trail
s non-terminal. So now we have the conundrum - we cannot distinguish
where the name terminating ':' character sat around comments 1 and 2 -
the colon could have been before, between or after comments - the parse
for all three would be identical, and similarly with comments 3 and 4
with respect to the (quoted) string characters.
I'm not saying this is critical, or even perhaps important, but it is a
drawback and in part arises from some promiscuous permission of (comment
including) whitespace, mostly between unitary pre-and post-fix operators
and their operands (mark, repetition), which I frankly often find
distracting
Miscellaneous
The test ixml/ixml-one-line needs whitespace after most of the
rule-concluding periods, to meet the current spec.
--
*John Lumley* MA PhD CEng FIEE
john@saxonica.com
Received on Wednesday, 11 May 2022 11:32:05 UTC