- From: John Lumley <john@saxonica.com>
- Date: Tue, 9 May 2023 13:19:23 +0100
- To: ixml <public-ixml@w3.org>
- Message-ID: <b0dea648-49ef-182e-4038-7794942aedd9@saxonica.com>
As part of ACTION 2023-04-11-d from the last meeting, attached are some
notes about issues on dynamic (computed) renaming of elements and
attributes in iXML.
John Lumley
iXML: Dynamic renaming of elements or attributes
John Lumley - 2023may09
One possible extension to iXML serialisation semantics involve the
dynamic creation of the name of an element or attribute in the final XML
result tree. This would be necessary for situations such as parsing
XML-like structures using iXML grammars.
In the general case the name could involve the computation of a string
value from either some portion of the input string or some function over
the partially-parsed 'serialisation' tree at the point where the element
or attribute itself is being created.
To illustrate and discuss this operation and the issues that might arise
therefrom, I'm going to use the example of parsing XML given in the
Balisage
paperhttps://www.balisage.net/Proceedings/vol27/html/Sperberg-McQueen01/BalisageVol27-Sperberg-McQueen01.html.
Whilst that paper is primarily concerned with the addition of pragma
support, this example does illustrate some of the issues that dynamic
renaming will raise. (I am not of course advocating using iXML to parse
XML, but the syntax of serialised XML is after all one with which the
reader/will/be familiar.)
The base iXML grammar is:
|{ A grammar for a small subset of XML, as an illustration. } document:
ws?, element, ws? . element: starttag, content, endtag; soletag .
-starttag: -"<", @gi, (ws, attribute)*, ws?, -">". -endtag: -"</", @gi2,
(ws, attribute)*, ws?, -">". -soletag: -"<", @gi, (ws, attribute)*, ws?,
-"/>". attribute: @name, ws?, -"=", ws?, @value. value: dqstring;
sqstring. -dqstring: dq, ~['"']*, dq. -sqstring: sq, ~["'"]*, sq. -dq:
-['"']. -sq: -["'"]. { allow at most one PCDATA block between pieces of
markup } -content: PCDATA?, ((processing-instruction; comment;
element)++(PCDATA?), PCDATA?)?. PCDATA: (~["<>&"]; "&"; "<";
">"; "'"; """)+. processing-instruction: -"<?", @name, ws,
@pi-data, -"?>". comment: -"<--", comment-data, -"-->". gi: name. gi2:
name. { name is left as an exercise for the reader. } -ws: (#20; #A; #C;
#9)+. |
which when given an input:|<abc def="ghi"/>|will produce an output tree of
|<document> <element gi="abc"> <attribute name="def" value="ghi"/>
</element> </document> |
Obviously a simple XSLT transformation along the lines of:
|<xsl:template match="element"> <xsl:element name="{@gi}">
<xsl:apply-templates/> </xsl:element> </xsl:template> <xsl:template
match="attribute"> <xsl:attribute name="{@name}" select="@value"/>
</xsl:template> ... |
will generate the intended XML tree, but what directives might we add to
the original iXML grammar to achieve the same result?
For simplicity the following discussion will concentrate on
the|soletag|form, that is a simplification of the first few rules to
|element: soletag . -soletag: -"<", @gi, (ws, attribute)*, ws?, -"/>".
attribute: @name, ws?, -"=", ws?, @value. |
though the conclusions apply to a more general XML input with opening
and closing tags.
Firstly we will need to mark that the names of the element and
attribute(s) require dynamic computation:
|NAME element: soletag . -soletag: -"<", @gi, (ws, attribute)*, ws?,
-"/>". NAME attribute: @name, ws?, -"=", ws?, @value. |
where*NAME*is a recognised token/directive. Then we have to
describe/how/the string value of the respective name should be computed.
The first issue is over what 'input' should the name be computed?
Obviously the original input string from the point of matching the given
non-terminal is available, e.g. for the attribute the
string|def="ghi"/>|, and some formula over that could be declared, but
of course we already have components of a partially-serialised tree
available, which has, by following the iXML mark directives, helpfully
discarded syntactic constants, such as whitespace,|=|and|"|:
|<... name="def" value="ghi"/> |
so the obvious choice would be to support some sort of XPath-like
expression, the atomisation of which will provide the necessary name.
Thus we can decorate:
|NAME(@gi) element: soletag . -soletag: -"<", @gi, (ws, attribute)*,
ws?, -"/>". NAME(@name) attribute: @name, ws?, -"=", ws?, @value. |
where the XPath expressions (|@gi|and|@name|) are evaluated with a
'temporary' constructed element (|element|and|attribute|respectively) as
context item. With these computed names we can construct the final tree
components, yielding
|<abc gi="abc"> <def name="def" value="ghi"/> </abc> |
Adding the attribute serialisation directive to attribute:|NAME(@name) @
attribute: @name....|, gives us:
|<abc gi="abc" def="defghi"/> |
which is probably not quite what we really want. Firstly we have
the|@gi|attribute remaining, when its only real purpose was to capture
the name of the element, and secondly, the value of the|@def|attribute
contains not just the expected value, but the value of the
'name-carrying' attribute|name="def"|as well.
So somehow in this type of renaming we will need to supress the
'temporary carriers' in the final output. We already have a
'suppression' mark|-|, so it's tempting to try that, changing the
carriers to element rather than attribute form.:
|NAME(@gi) element: soletag . -soletag: -"<", @gi, (ws, attribute)*,
ws?, -"/>". NAME(name) attribute: -name, ws?, -"=", ws?, value. |
But now when we come to compute the name by executing the XPath
expression|name|on the temporary tree, a failure ensues, as
there/isn't/an element with name|name|as the serialisation of its
element has been suppressed by the mark|-|. The atomisation of the XPath
expression is now an empty string, which is of course not a valid XML
NCName.
So suppose we somehow arrange to compute the name/before/any suppression
has been performed, i.e. on the full tree, where the leaves contain all
the characters in the input, in sequence. Indeed there is now
a|name|element and the execution of the XPath|name|would now yield a
non-empty string, which was actually correct as|abc|.
So far so good, but let us look at another, very simple example:
|type: -"type:", tName. {[name .]} tName: -'"', [L]+, -'"'. |
which with input:|type:"foo"|would produce an output:
|<type> <foo>foo</foo> </type> |
If we now suppress the|[L]+|and compute the name/before/any suppression
is considered, the name would now be computed as|"foo"|(including the
quoatation marks) and therefore invalid as an XML name. Using the single
suppression mark|-|, there is no way to distinguish between terms which
should be discarded completely, and those that should be
discarded/after/other information has been extracted from them.
This then leads to a tentative conclusion that not only does there need
to be a 'dynamic name' directive, there would, in most likely use cases,
have to be some form of*DROP*or distinguishable suppression directive, e.g.
|NAME(@gi) element: soletag . -soletag: -"<", DROP @gi, (ws,
attribute)*, ws?, -"/>". NAME(@name) attribute: DROP @name, ws?, -"=",
ws?, @value. |
Effectively we now have a two-phase serialization. In the first
pass,*DROP*is ignored (but not suppression marks (|-|)),
and*NAME*computed, and in the second pass (using the stored name for the
node)*DROP*is honoured.
Perhaps this approach with 'two-pass serialization' needs to be smarter
and should only work with pairings of a*NAME*and an associated set
of*DROP*directives. In effect|DROP @gi|is associated with (bound
to?)|NAME(@gi)|and similarly|DROP @name|with|NAME(@name)|. It is
tempting to use the XPath expression as some form of 'id' to act as the
'linkage' but there might be repetition within different local scopes
and we might also use more complex XPath expressions in name
computation, such as|'bar:'||@name|to generate a namespaced element.
This needs some experimentation.
/What might happen if 'nested'*NAME*directives exist, with an outer one
extracting a name from an inner one? This needs some thought/
Attachments
- text/plain attachment: rename.md
Received on Tuesday, 9 May 2023 12:19:52 UTC