- From: Norm Tovey-Walsh <norm@saxonica.com>
- Date: Sat, 02 Apr 2022 11:07:41 +0100
- To: public-ixml@w3.org
- Message-ID: <m2r16f6ak2.fsf@Hackmatack.fritz.box>
Steven Pemberton <steven.pemberton@cwi.nl> writes: > The nice thing about "+" is that it is the obvious opposite of "-", > but has the disadvantage is that it is already used for repeats. > The nice thing about "^" is that the syntax doesn't change, and it > looks like a proof-reader's insert mark. Assuming we change the repeat0 and repeat1 marks to ** and ++, we get: ^test: "a"+, +"insertion", "b"+, "c"++",". or ^test: "a"+, ^"insertion", "b"+, "c"++",". This new use for “+” is at the beginning of the string, not the end, which seems different enough to me. And I think users are going to come to think of “^” as “insert an element” so there’s some possibility for confusion about what “^"insertion"” means. I suppose you might accidentally type "a"++"insertion" which would silently do the wrong thing. We could go with a completely different character like “&”, indicative of entities in XML, perhaps? Or ⎀. :-) If it’s strictly a choice between “+” and “^”, on balance, I think “+” is better. Et voilà: $ cat insdata.ixml data: @xmlns, value+-",". xmlns: +"http://example.com/data". value: pos; neg. -pos: +"+", digit+. -neg: +"-", -"(", digit+, -")". -digit: ["0"-"9"]. $ coffeepot -g:insdata.ixml "100,(200),300,(400)" -pp <data xmlns="http://example.com/data"> <value>+100</value> <value>-200</value> <value>+300</value> <value>-400</value> </data> It was an interesting challenge to implement. The description that an insertion is “always treated as an empty string so it always matches” was not at all helpful as an implementation strategy (despite a couple of hours of fruitlessly believing it might be). My parser matches a sequence of tokens. For Invisible XML, that’s always a sequence of characters. There is no “empty sequence” character that occurs between two characters that can be said to “match” anything. Although I managed to twist things such that there were “empty sequence tokens” in the right-hand-sides of rules, and I persuaded the parser to skip over them, “always matching” in a sense, they didn’t actually generate any nodes in the parse forest so they were invisible. It might have been possible to reconstruct them after the fact from the partial states in the forest, but I wasn’t at all confident that would work reliably. There’s a delicate bit of binary balancing going on in the shared packed parse forest and having parts of states that the balancer was blind to seemed like a catastrophe in the making. In the end, I went with a strategy of decorating the preceding symbol in the rule with an “insertion” attribute containing the text to insert. I decorate the rule symbol with the attribute if there’s no preceding symbol. Then, with a little bit of care, it’s possible to generate text nodes in the output tree in the right places. Both of my unit tests for insertions pass, so it must bug free, right? :-) Be seeing you, norm -- Norm Tovey-Walsh Saxonica
Received on Saturday, 2 April 2022 10:31:58 UTC