Possible solutions for ISSUE 87 from Mark Birbeck on 2008-03-13 (public-rdf-in-xhtml-tf@w3.org from March 2008)

From: Mark Birbeck <mark.birbeck@x-port.net>
Date: Thu, 13 Mar 2008 14:00:51 +0000
To: "W3C RDFa task force" <public-rdf-in-xhtml-tf@w3.org>
Cc: www-rdf-interest@w3.org, "Jeremy J. Carroll" <jjc@hpl.hp.com>
Message-ID: <a707f8300803130700g63c9e6d6gdb2c9ac3909d30f5@mail.gmail.com>
Hello all,

During our discussions last week, I suggested that there are a number
of ways that we could tackle the rdf:XMLLiteral question. However, the
more I've delved into this, the more I've had to conclude that we
can't solve it, at least in a very straightforward way.

I've presented the details below, and I'm also copying to the RDF
interest list, because I believe there is an issue of interpretation
here, in relation to RDF Concepts [1], that may impact our resolution.
(In particular, there may be a view that we can be more liberal than I
am being, in which case we might be able to add more explicit support
after all.) I'm also CCing Jeremy because he wrote some interesting
comments on XML literals in the context of reviewing the early RDFa
drafts, and if anyone can find a way through this, it will be him! (No
pressure... ;) )


CONTEXT

If we run a Last Call conformant RDFa parser over the following:

  <h2 property="dc:title" datatype="rdf:XMLLiteral">
    E = mc<sup>2</sup>: The Most Urgent Problem of Our Time
  </h2>

we get an XML literal that obviously contains XHTML, but doesn't have
the XHTML namespace anywhere.

To be correct according to RDF Concepts, the parsed output would need to be:

  <> dc:title
    "E = mc<sup xmlns="http://www.w3.org/1999/xhtml">2</sup>: ...
    ... The Most Urgent Problem of Our Time"^^rdf:XMLLiteral .

Note the addition of the default namespace.


EXCLUSIVE CANONICALISATION

The RDF Concepts document says that an XML literal needs to be
"exclusive Canonical XML". The algorithm for this is obtained from the
Exclusive XML Canonicalization spec [2], and essentially dictates that
currently in-scope namespaces must be placed on the apex node, and
that all 'visibly utilised' namespaces must appear on the most
appropriate start tag, if that namespace has not been defined on an
ancestor.

For example, the Exclusive Canonicalization of this:

  <div>
    <svg:rect ...>
      <xf:input ...>...</xf:input>
      <img ... />
    </svg:rect>
  </div>

would be this

  <div xmlns="...">
    <svg:rect xmlns:svg="..." ...>
      <xf:input xmlns:xf="..." ...>...</xf:input>
      <img ... />
    </svg:rect>
  </div>

The root <div> is the 'apex node'.


PROBLEMS FOR IMPLEMENTATIONS

The problems that we have with this in RDFa parsers fall into two
categories; those that simply involve implementing the algorithm, and
those that relate to the data having to be interpreted as an XPath
data model.


PROBLEMS: ALGORITHM

>From the algorithm's point of view, the easy part is that the apex
node must contain all currently active namespaces; we have these,
because they are the currently in-scope prefix mappings in our
processing rules. We could therefore easily 'dump' those onto the apex
node.

However, the next part is slightly more tricky, in that any "visibly
utilised" namespace must be added to the correct start tag, if it's
not already on an ancestor. Actually, it's stronger than that in that
the namespace must *not* appear if it has been defined by an ancestor.
The following would therefore be incorrect:

  <div xmlns="...">
    <svg:rect xmlns:svg="..." ...>
      <xf:input xmlns:xf="..." ...>
        <xf:label xmlns:xf="..." ...>...</xf:label>
      </xf:input>
      <img ... />
    </svg:rect>
  </div>

The reason why this would be 'wrong' (so to speak) is that the XForms
label element does not need the XForms namespace, since it is already
present on the XForms input control.

(As explained at the end, I think this is an unnecessary restriction,
and has unfortunate consequences.)


PROBLEMS: XPATH DATA MODEL

But the bigger problem I foresee, is that the XML literal must be
processed using the XPath data model, which means sorting out things
like entities, removing comments, and so on. This seems to imply that
an RDFa parser would need to support an XML parser, which seems an
unfortunate requirement.


ARE THERE ANY EASY SOLUTIONS?

I'm afraid that I don't believe there are any easy solutions. If we
explicitly say that we are creating XML literals, then I don't see any
way that they can't be 'proper' XML literals, as laid down by the RDF
Concepts document, and that means Exclusive Canonicalisation. In turn,
that means namespaces have to be sorted out, entities have to be
encoded/decoded/etc., and so on.

So...my gut feeling is that RDFa should not 'support' XML literals in
this release.

However, we _should_ reserve all of the necessary architecture, such
as saying that @datatype="rdf:XMLLiteral" is reserved but undefined,
that @property with no @content but with child elements is undefined,
and so on.

Of course, for the sake of producing useful software, implementers
would be advised to create a 'dumb' XML literal, by simply copying the
inner content of the child elements. We can say something like "we'll
look for implementer experience to help guide this part of the spec in
a future version". But the main point is that I don't think we can say
we are properly supporting XML literals unless we support Exclusive
Canonicalisation, and that is quite a burden.


SIDE NOTES

My feeling is that this is not a problem of our making, and that XML
literals are just pretty badly defined. The problme in my view is not
that they rely on Exclusive Canonicalisation, but that they do so in
the wrong way.

Any comparison that takes place between values would have to achieved
by parsing those values in an XML parser anyway (as RDF Concepts also
says), and making a comparison at the level of the infoset. Which
means that these two fragments of XML would cause a match when
compared in this way:

  <div xmlns="...">
    <svg:rect xmlns:svg="..." ...>
      <xf:input xmlns:xf="..." ...>
        <xf:label xmlns:xf="..." ...>...</xf:label>
      </xf:input>
      <img ... />
    </svg:rect>
  </div>

  <div xmlns="..." xmlns:svg="..." xmlns:xf="...">
    <svg:rect ...>
      <xf:input ...>
        <xf:label ...>...</xf:label>
      </xf:input>
      <img ... />
    </svg:rect>
  </div>

However, the first fragment is not strictly 'exclusively
canonicalised', due to the extra namespace. So the process should be
to canonicalise, and then compare.

But what RDF Concepts does is to say (effectively) that we should
canonicalise the XML, and then store it. And then later on, if we want
to compare, we already have the canonicalised form. But the big
problem with this is that we are no longer able to simply store
structured mark-up that we want to round-trip, without comparing it to
anything.

What RDF Concepts should have done, in my opinion, is used the idea of
an XML literal to simply indicate the datatype, as a kind of flag, and
then leave the Exclusive Canonicalisation stuff to the act of
comparison. If data is simply being stored for later retrieval then
why go to lots of effort to store it in an 'unambiguous' way? In
particular, why require that all RDF applications must support an XML
parser?


But since this is not in our power to control, I think punting it to a
future version of RDFa makes some sense. And in the short-term,
implementers can add 'dumb' support to their parsers.


(I've not really discussed other possible solutions such as inventing
our own XHTML datatype, since I think they are the wrong way to go,
and I didn't get the sense that anyone was completely enthusiastic
about that route, on the call. But there are some angles to it, if
people really feel we must have a solution now, rather than postponing
this to a future version of RDFa.)

Regards,

Mark

[1] <http://www.w3.org/TR/rdf-concepts/>
[2] <http://www.w3.org/TR/xml-exc-c14n/>

-- 
  Mark Birbeck

  mark.birbeck@x-port.net | +44 (0) 20 7689 9232
  http://www.x-port.net | http://internet-apps.blogspot.com

  x-port.net Ltd. is registered in England and Wales, number 03730711
  The registered office is at:

    2nd Floor
    Titchfield House
    69-85 Tabernacle Street
    London
    EC2A 4RR
Received on Thursday, 13 March 2008 14:01:33 UTC