Re: Insertions from Norm Tovey-Walsh on 2022-04-02 (public-ixml@w3.org from April 2022)

From: Norm Tovey-Walsh <norm@saxonica.com>
Date: Sat, 02 Apr 2022 11:07:41 +0100
To: public-ixml@w3.org
Message-ID: <m2r16f6ak2.fsf@Hackmatack.fritz.box>
Steven Pemberton <steven.pemberton@cwi.nl> writes:
> The nice thing about "+" is that it is the obvious opposite of "-",
> but has the disadvantage is that it is already used for repeats.
> The nice thing about "^" is that the syntax doesn't change, and it
> looks like a proof-reader's insert mark.

Assuming we change the repeat0 and repeat1 marks to ** and ++, we get:

  ^test: "a"+, +"insertion", "b"+, "c"++",".

or

  ^test: "a"+, ^"insertion", "b"+, "c"++",".

This new use for “+” is at the beginning of the string, not the end,
which seems different enough to me. And I think users are going to come
to think of “^” as “insert an element” so there’s some possibility for
confusion about what “^"insertion"” means.

I suppose you might accidentally type "a"++"insertion" which would
silently do the wrong thing. We could go with a completely different
character like “&”, indicative of entities in XML, perhaps?
Or ⎀. :-)

If it’s strictly a choice between “+” and “^”, on balance, I think “+”
is better.

Et voilà:

$ cat insdata.ixml
data: @xmlns, value+-",".
xmlns: +"http://example.com/data".
value: pos; neg.
-pos: +"+", digit+.
-neg: +"-", -"(", digit+, -")".
-digit: ["0"-"9"].

$ coffeepot -g:insdata.ixml "100,(200),300,(400)" -pp
<data xmlns="http://example.com/data">
   <value>+100</value>
   <value>-200</value>
   <value>+300</value>
   <value>-400</value>
</data>

It was an interesting challenge to implement. The description that an
insertion is “always treated as an empty string so it always matches”
was not at all helpful as an implementation strategy (despite a couple
of hours of fruitlessly believing it might be).

My parser matches a sequence of tokens. For Invisible XML, that’s always
a sequence of characters. There is no “empty sequence” character that
occurs between two characters that can be said to “match” anything.

Although I managed to twist things such that there were “empty sequence
tokens” in the right-hand-sides of rules, and I persuaded the parser to
skip over them, “always matching” in a sense, they didn’t actually
generate any nodes in the parse forest so they were invisible.

It might have been possible to reconstruct them after the fact from the
partial states in the forest, but I wasn’t at all confident that would
work reliably. There’s a delicate bit of binary balancing going on in
the shared packed parse forest and having parts of states that the
balancer was blind to seemed like a catastrophe in the making.

In the end, I went with a strategy of decorating the preceding symbol in
the rule with an “insertion” attribute containing the text to insert. I
decorate the rule symbol with the attribute if there’s no preceding
symbol. Then, with a little bit of care, it’s possible to generate text
nodes in the output tree in the right places.

Both of my unit tests for insertions pass, so it must bug free,
right? :-)

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh
Saxonica
Received on Saturday, 2 April 2022 10:31:58 UTC