Some notes aboutt dynamic renaming

As part of ACTION 2023-04-11-d from the last meeting, attached are some 
notes about issues on dynamic (computed) renaming of elements and 
attributes in iXML.

John Lumley



  iXML: Dynamic renaming of elements or attributes

John Lumley - 2023may09

One possible extension to iXML serialisation semantics involve the 
dynamic creation of the name of an element or attribute in the final XML 
result tree. This would be necessary for situations such as parsing 
XML-like structures using iXML grammars.

In the general case the name could involve the computation of a string 
value from either some portion of the input string or some function over 
the partially-parsed 'serialisation' tree at the point where the element 
or attribute itself is being created.

To illustrate and discuss this operation and the issues that might arise 
therefrom, I'm going to use the example of parsing XML given in the 
Balisage 
paperhttps://www.balisage.net/Proceedings/vol27/html/Sperberg-McQueen01/BalisageVol27-Sperberg-McQueen01.html. 
Whilst that paper is primarily concerned with the addition of pragma 
support, this example does illustrate some of the issues that dynamic 
renaming will raise. (I am not of course advocating using iXML to parse 
XML, but the syntax of serialised XML is after all one with which the 
reader/will/be familiar.)

The base iXML grammar is:

|{ A grammar for a small subset of XML, as an illustration. } document: 
ws?, element, ws? . element: starttag, content, endtag; soletag . 
-starttag: -"<", @gi, (ws, attribute)*, ws?, -">". -endtag: -"</", @gi2, 
(ws, attribute)*, ws?, -">". -soletag: -"<", @gi, (ws, attribute)*, ws?, 
-"/>". attribute: @name, ws?, -"=", ws?, @value. value: dqstring; 
sqstring. -dqstring: dq, ~['"']*, dq. -sqstring: sq, ~["'"]*, sq. -dq: 
-['"']. -sq: -["'"]. { allow at most one PCDATA block between pieces of 
markup } -content: PCDATA?, ((processing-instruction; comment; 
element)++(PCDATA?), PCDATA?)?. PCDATA: (~["<>&"]; "&amp;"; "&lt;"; 
"&gt;"; "&apos;"; "&quot;")+. processing-instruction: -"<?", @name, ws, 
@pi-data, -"?>". comment: -"<--", comment-data, -"-->". gi: name. gi2: 
name. { name is left as an exercise for the reader. } -ws: (#20; #A; #C; 
#9)+. |

which when given an input:|<abc def="ghi"/>|will produce an output tree of

|<document> <element gi="abc"> <attribute name="def" value="ghi"/> 
</element> </document> |

Obviously a simple XSLT transformation along the lines of:

|<xsl:template match="element"> <xsl:element name="{@gi}"> 
<xsl:apply-templates/> </xsl:element> </xsl:template> <xsl:template 
match="attribute"> <xsl:attribute name="{@name}" select="@value"/> 
</xsl:template> ... |

will generate the intended XML tree, but what directives might we add to 
the original iXML grammar to achieve the same result?

For simplicity the following discussion will concentrate on 
the|soletag|form, that is a simplification of the first few rules to

|element: soletag . -soletag: -"<", @gi, (ws, attribute)*, ws?, -"/>". 
attribute: @name, ws?, -"=", ws?, @value. |

though the conclusions apply to a more general XML input with opening 
and closing tags.

Firstly we will need to mark that the names of the element and 
attribute(s) require dynamic computation:

|NAME element: soletag . -soletag: -"<", @gi, (ws, attribute)*, ws?, 
-"/>". NAME attribute: @name, ws?, -"=", ws?, @value. |

where*NAME*is a recognised token/directive. Then we have to 
describe/how/the string value of the respective name should be computed. 
The first issue is over what 'input' should the name be computed? 
Obviously the original input string from the point of matching the given 
non-terminal is available, e.g. for the attribute the 
string|def="ghi"/>|, and some formula over that could be declared, but 
of course we already have components of a partially-serialised tree 
available, which has, by following the iXML mark directives, helpfully 
discarded syntactic constants, such as whitespace,|=|and|"|:

|<... name="def" value="ghi"/> |

so the obvious choice would be to support some sort of XPath-like 
expression, the atomisation of which will provide the necessary name. 
Thus we can decorate:

|NAME(@gi) element: soletag . -soletag: -"<", @gi, (ws, attribute)*, 
ws?, -"/>". NAME(@name) attribute: @name, ws?, -"=", ws?, @value. |

where the XPath expressions (|@gi|and|@name|) are evaluated with a 
'temporary' constructed element (|element|and|attribute|respectively) as 
context item. With these computed names we can construct the final tree 
components, yielding

|<abc gi="abc"> <def name="def" value="ghi"/> </abc> |

Adding the attribute serialisation directive to attribute:|NAME(@name) @ 
attribute: @name....|, gives us:

|<abc gi="abc" def="defghi"/> |

which is probably not quite what we really want. Firstly we have 
the|@gi|attribute remaining, when its only real purpose was to capture 
the name of the element, and secondly, the value of the|@def|attribute 
contains not just the expected value, but the value of the 
'name-carrying' attribute|name="def"|as well.

So somehow in this type of renaming we will need to supress the 
'temporary carriers' in the final output. We already have a 
'suppression' mark|-|, so it's tempting to try that, changing the 
carriers to element rather than attribute form.:

|NAME(@gi) element: soletag . -soletag: -"<", @gi, (ws, attribute)*, 
ws?, -"/>". NAME(name) attribute: -name, ws?, -"=", ws?, value. |

But now when we come to compute the name by executing the XPath 
expression|name|on the temporary tree, a failure ensues, as 
there/isn't/an element with name|name|as the serialisation of its 
element has been suppressed by the mark|-|. The atomisation of the XPath 
expression is now an empty string, which is of course not a valid XML 
NCName.

So suppose we somehow arrange to compute the name/before/any suppression 
has been performed, i.e. on the full tree, where the leaves contain all 
the characters in the input, in sequence. Indeed there is now 
a|name|element and the execution of the XPath|name|would now yield a 
non-empty string, which was actually correct as|abc|.

So far so good, but let us look at another, very simple example:

|type: -"type:", tName. {[name .]} tName: -'"', [L]+, -'"'. |

which with input:|type:"foo"|would produce an output:

|<type> <foo>foo</foo> </type> |

If we now suppress the|[L]+|and compute the name/before/any suppression 
is considered, the name would now be computed as|"foo"|(including the 
quoatation marks) and therefore invalid as an XML name. Using the single 
suppression mark|-|, there is no way to distinguish between terms which 
should be discarded completely, and those that should be 
discarded/after/other information has been extracted from them.

This then leads to a tentative conclusion that not only does there need 
to be a 'dynamic name' directive, there would, in most likely use cases, 
have to be some form of*DROP*or distinguishable suppression directive, e.g.

|NAME(@gi) element: soletag . -soletag: -"<", DROP @gi, (ws, 
attribute)*, ws?, -"/>". NAME(@name) attribute: DROP @name, ws?, -"=", 
ws?, @value. |

Effectively we now have a two-phase serialization. In the first 
pass,*DROP*is ignored (but not suppression marks (|-|)), 
and*NAME*computed, and in the second pass (using the stored name for the 
node)*DROP*is honoured.

Perhaps this approach with 'two-pass serialization' needs to be smarter 
and should only work with pairings of a*NAME*and an associated set 
of*DROP*directives. In effect|DROP @gi|is associated with (bound 
to?)|NAME(@gi)|and similarly|DROP @name|with|NAME(@name)|. It is 
tempting to use the XPath expression as some form of 'id' to act as the 
'linkage' but there might be repetition within different local scopes 
and we might also use more complex XPath expressions in name 
computation, such as|'bar:'||@name|to generate a namespaced element.

This needs some experimentation.

/What might happen if 'nested'*NAME*directives exist, with an outer one 
extracting a name from an inner one? This needs some thought/

Received on Tuesday, 9 May 2023 12:19:52 UTC