- From: John Lumley <john@saxonica.com>
- Date: Tue, 9 May 2023 13:19:23 +0100
- To: ixml <public-ixml@w3.org>
- Message-ID: <b0dea648-49ef-182e-4038-7794942aedd9@saxonica.com>
As part of ACTION 2023-04-11-d from the last meeting, attached are some notes about issues on dynamic (computed) renaming of elements and attributes in iXML. John Lumley iXML: Dynamic renaming of elements or attributes John Lumley - 2023may09 One possible extension to iXML serialisation semantics involve the dynamic creation of the name of an element or attribute in the final XML result tree. This would be necessary for situations such as parsing XML-like structures using iXML grammars. In the general case the name could involve the computation of a string value from either some portion of the input string or some function over the partially-parsed 'serialisation' tree at the point where the element or attribute itself is being created. To illustrate and discuss this operation and the issues that might arise therefrom, I'm going to use the example of parsing XML given in the Balisage paperhttps://www.balisage.net/Proceedings/vol27/html/Sperberg-McQueen01/BalisageVol27-Sperberg-McQueen01.html. Whilst that paper is primarily concerned with the addition of pragma support, this example does illustrate some of the issues that dynamic renaming will raise. (I am not of course advocating using iXML to parse XML, but the syntax of serialised XML is after all one with which the reader/will/be familiar.) The base iXML grammar is: |{ A grammar for a small subset of XML, as an illustration. } document: ws?, element, ws? . element: starttag, content, endtag; soletag . -starttag: -"<", @gi, (ws, attribute)*, ws?, -">". -endtag: -"</", @gi2, (ws, attribute)*, ws?, -">". -soletag: -"<", @gi, (ws, attribute)*, ws?, -"/>". attribute: @name, ws?, -"=", ws?, @value. value: dqstring; sqstring. -dqstring: dq, ~['"']*, dq. -sqstring: sq, ~["'"]*, sq. -dq: -['"']. -sq: -["'"]. { allow at most one PCDATA block between pieces of markup } -content: PCDATA?, ((processing-instruction; comment; element)++(PCDATA?), PCDATA?)?. PCDATA: (~["<>&"]; "&"; "<"; ">"; "'"; """)+. processing-instruction: -"<?", @name, ws, @pi-data, -"?>". comment: -"<--", comment-data, -"-->". gi: name. gi2: name. { name is left as an exercise for the reader. } -ws: (#20; #A; #C; #9)+. | which when given an input:|<abc def="ghi"/>|will produce an output tree of |<document> <element gi="abc"> <attribute name="def" value="ghi"/> </element> </document> | Obviously a simple XSLT transformation along the lines of: |<xsl:template match="element"> <xsl:element name="{@gi}"> <xsl:apply-templates/> </xsl:element> </xsl:template> <xsl:template match="attribute"> <xsl:attribute name="{@name}" select="@value"/> </xsl:template> ... | will generate the intended XML tree, but what directives might we add to the original iXML grammar to achieve the same result? For simplicity the following discussion will concentrate on the|soletag|form, that is a simplification of the first few rules to |element: soletag . -soletag: -"<", @gi, (ws, attribute)*, ws?, -"/>". attribute: @name, ws?, -"=", ws?, @value. | though the conclusions apply to a more general XML input with opening and closing tags. Firstly we will need to mark that the names of the element and attribute(s) require dynamic computation: |NAME element: soletag . -soletag: -"<", @gi, (ws, attribute)*, ws?, -"/>". NAME attribute: @name, ws?, -"=", ws?, @value. | where*NAME*is a recognised token/directive. Then we have to describe/how/the string value of the respective name should be computed. The first issue is over what 'input' should the name be computed? Obviously the original input string from the point of matching the given non-terminal is available, e.g. for the attribute the string|def="ghi"/>|, and some formula over that could be declared, but of course we already have components of a partially-serialised tree available, which has, by following the iXML mark directives, helpfully discarded syntactic constants, such as whitespace,|=|and|"|: |<... name="def" value="ghi"/> | so the obvious choice would be to support some sort of XPath-like expression, the atomisation of which will provide the necessary name. Thus we can decorate: |NAME(@gi) element: soletag . -soletag: -"<", @gi, (ws, attribute)*, ws?, -"/>". NAME(@name) attribute: @name, ws?, -"=", ws?, @value. | where the XPath expressions (|@gi|and|@name|) are evaluated with a 'temporary' constructed element (|element|and|attribute|respectively) as context item. With these computed names we can construct the final tree components, yielding |<abc gi="abc"> <def name="def" value="ghi"/> </abc> | Adding the attribute serialisation directive to attribute:|NAME(@name) @ attribute: @name....|, gives us: |<abc gi="abc" def="defghi"/> | which is probably not quite what we really want. Firstly we have the|@gi|attribute remaining, when its only real purpose was to capture the name of the element, and secondly, the value of the|@def|attribute contains not just the expected value, but the value of the 'name-carrying' attribute|name="def"|as well. So somehow in this type of renaming we will need to supress the 'temporary carriers' in the final output. We already have a 'suppression' mark|-|, so it's tempting to try that, changing the carriers to element rather than attribute form.: |NAME(@gi) element: soletag . -soletag: -"<", @gi, (ws, attribute)*, ws?, -"/>". NAME(name) attribute: -name, ws?, -"=", ws?, value. | But now when we come to compute the name by executing the XPath expression|name|on the temporary tree, a failure ensues, as there/isn't/an element with name|name|as the serialisation of its element has been suppressed by the mark|-|. The atomisation of the XPath expression is now an empty string, which is of course not a valid XML NCName. So suppose we somehow arrange to compute the name/before/any suppression has been performed, i.e. on the full tree, where the leaves contain all the characters in the input, in sequence. Indeed there is now a|name|element and the execution of the XPath|name|would now yield a non-empty string, which was actually correct as|abc|. So far so good, but let us look at another, very simple example: |type: -"type:", tName. {[name .]} tName: -'"', [L]+, -'"'. | which with input:|type:"foo"|would produce an output: |<type> <foo>foo</foo> </type> | If we now suppress the|[L]+|and compute the name/before/any suppression is considered, the name would now be computed as|"foo"|(including the quoatation marks) and therefore invalid as an XML name. Using the single suppression mark|-|, there is no way to distinguish between terms which should be discarded completely, and those that should be discarded/after/other information has been extracted from them. This then leads to a tentative conclusion that not only does there need to be a 'dynamic name' directive, there would, in most likely use cases, have to be some form of*DROP*or distinguishable suppression directive, e.g. |NAME(@gi) element: soletag . -soletag: -"<", DROP @gi, (ws, attribute)*, ws?, -"/>". NAME(@name) attribute: DROP @name, ws?, -"=", ws?, @value. | Effectively we now have a two-phase serialization. In the first pass,*DROP*is ignored (but not suppression marks (|-|)), and*NAME*computed, and in the second pass (using the stored name for the node)*DROP*is honoured. Perhaps this approach with 'two-pass serialization' needs to be smarter and should only work with pairings of a*NAME*and an associated set of*DROP*directives. In effect|DROP @gi|is associated with (bound to?)|NAME(@gi)|and similarly|DROP @name|with|NAME(@name)|. It is tempting to use the XPath expression as some form of 'id' to act as the 'linkage' but there might be repetition within different local scopes and we might also use more complex XPath expressions in name computation, such as|'bar:'||@name|to generate a namespaced element. This needs some experimentation. /What might happen if 'nested'*NAME*directives exist, with an outer one extracting a name from an inner one? This needs some thought/
Attachments
- text/plain attachment: rename.md
Received on Tuesday, 9 May 2023 12:19:52 UTC