- From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
- Date: Mon, 11 Mar 2002 11:27:40 -0000
- To: <w3c-rdfcore-wg@w3.org>
- Message-ID: <CEECKEAMDAJDDEDGJNBECECCCAAA.jjc@hpl.hp.com>
After what I have heard in the telecon, I think it is worth stepping through some very simple examples, being aware of what xslt makes of them. This is a fairly long message, sorry. We will arrive at a single 'complete' proposal for xml literal. The only thing we are considering here is namespaces within the xml literal "<foo/>". There is a zip file attached, but it is only if you wish to run the examples with your own version of xslt. I have used saxon 6.4 Outline ======= 0: Assumptions 1: Namepaces That Aren't Used Should Be Ignored 2: Use of Exclusive Canonicalization 3: Difficulties with QNames as Attribute Values 4: InclusiveNameSpaces & Attribute Value "Literal" 5: Comments 6: A Proposal 7: What's the other path? 0: Assumptions ============== I assume: - we do not want "namespace pollution" - we want RDF/XML to be processable through XSLT without getting corrupted. - following Eric's comments about comments, that we do not want to lose potentially relevant information. The second condition is tested using the copy transform taken verbatim from the XSLT recommendation (copy.xsl in zip): [[[ <!-- This program is taken from the XSLT recommendation: http://www.w3.org/TR/1999/REC-xslt-19991116#copying --> <!-- For example, the identity transformation can be written using xsl:copy as follows: --> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> ]]] 1: Namepaces That Aren't Used Should Be Ignored =============================================== So applying this to file a_1.xml i.e. [[[ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:eg="http://example.org/" > <rdf:Description> <eg:a rdf:parseType="Literal"> <foo/> </eg:a> </rdf:Description> </rdf:RDF> ]]] We get c_1.xml: [[[ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:eg="http://example.org/"> <rdf:Description> <eg:a rdf:parseType="Literal"> <foo/> </eg:a> </rdf:Description> </rdf:RDF> ]]] The very similar a_2.xml: [[[ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:eg="http://example.org/" > <rdf:Description> <eg:a rdf:parseType="Literal"> <foo></foo> </eg:a> </rdf:Description> </rdf:RDF> ]]] is 'copied' to c_2.xml which is identical to c_1.xml This is an example of how differences that are not in infoset are ignored by XSLT. Now, slightly more to the point, in a_3.xml we have a difference that is in infoset: [[[ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:EG="http://example.org/" > <rdf:Description> <EG:a rdf:parseType="Literal"> <foo/> </EG:a> </rdf:Description> </rdf:RDF> ]]] The namespace prefix eg has been replaced by the namespace prefix EG. c_3.xml, the result of copying a_3, is not surprising: [[[ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:EG="http://example.org/"> <rdf:Description> <EG:a rdf:parseType="Literal"> <foo/> </EG:a> </rdf:Description> </rdf:RDF> ]]] At this stage, it appears as though changing the namespace prefix has not changed the xml literal (which doesn't use any namespaces!). However, a different transform extracts the xml literal from its element and makes it a complete xml document. The first two examples (i.e. x_1.xml and x_2.xml) in the zip are [[[ <a> <foo xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:eg="http://example.org/"/> </a> ]]] Whereas the third example (x_3.xml) is: [[[ <a> <foo xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:EG="http://example.org/"/> </a> ]]] NOTE Both namespaces are part of the <foo/> element as far as XSLT is concerned, and the namespace prefixes matter. Thus as far as xslt is concerned, the xml literals in a_1.xml and a_3.xml are different, even though both are "<foo/>" surrounded by identical whitespace. OPINION (uncontroversial?) ======= I regard these extracts as illustrating "namespace pollution". I think that the two documents a_1.xml and a_3.xml describe the same RDF graph despite the difference between them (prefix "eg" replaced by prefix "EG"). Moving on to a_4.xml this is: [[[ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:eg="http://example.org/" > <rdf:Description> <eg:a rdf:parseType="Literal"> <foo xmlns:eg="http://example.org/" /> </eg:a> </rdf:Description> </rdf:RDF> ]]] If this is our RDF input file, the author may expect that the namespace "eg" is present on the xml literal. If you look at the xml (as text) it is indeed there! But ... If we xslt copy this we get c_4.xml [[[ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:eg="http://example.org/"> <rdf:Description> <eg:a rdf:parseType="Literal"> <foo/> </eg:a> </rdf:Description> </rdf:RDF> ]]] which is the same as c_1.xml and c_2.xml. What has happened is that the data model used by XSLT uses namespace attribtues to compute the namespaces on the elements and then discards them. The new namespace declaration does not change the namespaces on that element ("eg" was already in scope) and hence is ignored completely. Indeed running the extract transform to get x_4.xml we also get (almost) the same as before: [[[ <a> <foo xmlns:eg="http://example.org/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"/> </a> ]]] (note that Saxon has reversed the order of the namespace attributes, this is not in infoset, and should be ignored) However putting the same text string into the context of a_3 we get a_5.xml: [[[ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:EG="http://example.org/" > <rdf:Description> <EG:a rdf:parseType="Literal"> <foo xmlns:eg="http://example.org/" /> </EG:a> </rdf:Description> </rdf:RDF> ]]] This one is distinguishable under XSLT from all the others. If we look at the "copied" file c_5.xml we see that the extra namespace declaration does not vanish: [[[ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:EG="http://example.org/"> <rdf:Description> <EG:a rdf:parseType="Literal"> <foo xmlns:eg="http://example.org/"/> </EG:a> </rdf:Description> </rdf:RDF> ]]] Moreover looking at the extract file x_5.xml, we see that the literal has more namespaces than previously: [[[ <a> <foo xmlns:eg="http://example.org/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:EG="http://example.org/"/> </a> ]]] My take on these examples is that: - the first three examples are all the same. (the first two have identical infoset, the third is identical text). - so is the fourth, because XSLT cannot distinguish it from the first. - also the fifth is the same as the fourth because the text version is self contained and identical. i.e. all the examples (1 to 6) are basically "<foo/>" which only refers to the default namespace and so any other namespace declaration is irrelevant!! This differs from Infoset which sees the namespace attributes and the namespaces as part of the element content, and from XSLT which doesn't see the namespace attributes but does see *all* the namespaces as part of the element content. 2: Use of Exclusive Canonicalization ==================================== There is only one XML spec that I am aware of which worries in this sort of way about referring to namespaces; ignoring ones that are not used. That spec is the exclusive canonicalization spec. The key concept is: http://www.w3.org/TR/2002/CR-xml-exc-c14n-20020212#def-visibly-utilizes [[[ An element E in a document subset visibly utilizes a namespace declaration, i.e. a namespace prefix P and bound value V, if E or an attribute node in the document subset with parent E has a qualified name in which P is the namespace prefix. A similar definition applies for an element E in a document subset that visibly utilizes the default namespace declaration, which occurs if E has no namespace prefix ]]] (that is the only new concept in exc-c14n). Using this concept we could imagine a statement like: "An xml literal includes the namespaces which are visibly utilized by that literal, and no others." or (more strongly) "An xml literal is formed by taking the exclusive canonicalization of the element content." Either of these statements would be consistent with all the examples 1 to 6 being of the same literal. Note that the examples 4, 5 and 6 in which the original XML has explicit namespace declarations within the xml literal *do not* visibly use those namespaces, and so the namespace declarations are simply ignored. I use the transform smaller.xsl in the zip to make the files s_1.xml etc. These are like the x_1.xml etc but without the invisible namespaces. <aside> Two other possibilities other than using at least the concept of visible utilization from exc-c14n are: - follow M&S in *not* addressing the namespace in xml literal issue. - do our own thing independent of XML groups. I do not see either of these as attractive. </aside> 3: Difficulties with QNames as Attribute Values =============================================== Moving on to example 7 a_7.xml: [[[ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:eg="http://example.org/" xmlns:q="http://example.org/q" > <rdf:Description> <eg:a rdf:parseType="Literal"> <foo bar="q:name"/> </eg:a> </rdf:Description> </rdf:RDF> ]]] As far as XSLT is concerned the qname in the attribute value is well-formed. If we take the extract x_7.xml we see: [[[ <a> <foo xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:eg="http://example.org/" xmlns:q="http://example.org/q" bar="q:name"/> </a> ]]] and the "q" namespace is still around. But the rule of ignoring invisible namespaces applies also to the q namespace and so s_7.xml is: [[[ <a> <foo bar="q:name"/> </a> ]]] Note that even if we use a_8.xml where the namespace is only declared on the xml literal it still is "invisible" by the definition used. a_8.xml [[[ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:eg="http://example.org/" > <rdf:Description> <eg:a rdf:parseType="Literal"> <foo bar="q:name" xmlns:q="http://example.org/q"/> </eg:a> </rdf:Description> </rdf:RDF> ]]] still shows the following visible part s_8.xml (same as s_7.xml) [[[ <a> <foo bar="q:name"/> </a> ]]] This situation is envisaged by exclusive canonicalization and they have three solutions, all clunky: [[[ + the XML must be modified so that use of the namespace prefix involved is visible or + the namespace declarations must appear and be bound to the same values in every context in which the XML will be interpreted or + the prefixes for such namespaces must appear in the InclusiveNamespaces PrefixList a special parameter to list the unusual namespace prefixes which are needed despite being invisible. ]]] The first one means getting the document author to add q:ignoreMe="please" as an attribute to the element and hoping that it doesn't cause problems. The second one would work for a small set of well-known namespace prefixes. The third one is the only general purpose solution. 4: InclusiveNameSpaces & Attribute Value "Literal" ================================================== Within RDF/XML unfortunately, the natural way to list these unusual namespace prefixes would be to use additional xmlns declaration. This doesn't work if we wish to be XSLT-safe. XSLT systematically ignores such declarations which repeat something that is already in scope. A good example of a likely case is xmlns:xsd. xsd is a prefix that is likely to be defined at the top level, and may occur in a qname in an attribute value in an xml literal in RDF! We could list these unusual namespaces using an additional attribute e.g. rdfns:xsd .... This suffers from being fairly not backwardly compatible. Oh dear, what we could do is decide to add these unusual namespaces after the word Literal within the parseType. e.g. a_9.xml [[[ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:eg="http://example.org/" xmlns:q="http://example.org/q" > <rdf:Description> <eg:a rdf:parseType="Literal q"> <foo bar="q:name"/> </eg:a> </rdf:Description> </rdf:RDF> ]]] This would identify q as a namespace prefix to be treated unusually, i.e. as always visisble on any element in which it is in scope. Thus the xml literal is (the hand-written x_9.xml) [[[ <a> <foo xmlns:q="http://example.org/q" bar="q:name"/> </a> ]]] So there we have it. We can prevent namespace pollution using the concept of visible utilization. We can allow the unusual namespace use (e.g. qnames as attribute values) by listing the unusual namespaces on the parseType value. This is XSLT safe. For greater precision, particularly for defining equality, we can specifiy the use of XML canonicalization. 5: Comments =========== This message is based around one extreme of the solution space. We try and fully specifiy what an xml literal is, and we try and get it right. This will give maximum interoperability, at the cost of difficulty for implementors. As such I support Eric's remarks: "I strongly vote in favor of preserving comments. I have a knee-jerk reaction to deleting any information, and I believe this is what would be expected by content produces that take time to include comments." Thus example a_10 is different from a_1 a_10.xml [[[ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:eg="http://example.org/"> <rdf:Description> <eg:a rdf:parseType="Literal"><!-- this comment is part of the literal --> <foo/> </eg:a> </rdf:Description> </rdf:RDF> ]]] the literal is extracted as s_10.xml [[[ <a><!-- this comment is part of the literal --> <foo/> </a> ]]] 6: A Proposal ============= This proposal is based on maximally specifying the behaviour to minimize interoperability problems, at the expense of requiring work from implementors. There is the assumption that the WG wishes to address xml literals that: - use namespaces - use namespaces within attribute values Propose that: - rdf:parseType="Literal" is replaced by rdf:parseType=literal where literal is a list of names starting with the name "Literal" - the value of such literal is the xml literal with string component given by the exclusive canonicalization of the element content. - that the c14n used includes comments - that the c14n used uses the second and subsequent names from the value of the rdf:parseType attribute as the InclusiveNameSpace Prefix list parameter to the exclusive c14n algorithm. - equality between the string components of xml literals is given by binary equality - close the xml literal issues. Moreover, I could be actioned to draft an appendix to the syntax doc showing how minimal RDF implementations that: - do not need equality - (and/or) can assume a complete set of namespaces for xml literals# can be implemented satisfactorily without use of a c14n module. 7: What's the other path? ========================= If the above proposal looks too heavy, I would suggest dropping qnames in attribute values from the level of ambition, and merely trying to not prevent implementations from treating unusual namespaces unusually. We would then stick with "Literal" and "Resource" as the only two values of parseType. Vagueness is possible about precisely what string is produced. A more limited interoperablity could be achieved by concentrating the spec on the equaity of literals. I am happy to produce a second proposal based around that path. Jeremy
Attachments
- application/x-zip-compressed attachment: literal.zip
Received on Monday, 11 March 2002 06:30:07 UTC