Re: XMLLiteral handling in RDFa in HTML from Philip Taylor on 2009-05-26 (public-html@w3.org from May 2009)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Tue, 26 May 2009 09:34:17 +0100
To: Manu Sporny <msporny@digitalbazaar.com>
CC: RDFa mailing list <public-rdf-in-xhtml-tf@w3.org>, HTMLWG WG <public-html@w3.org>
Message-ID: <4A1BA989.5080809@cam.ac.uk>

Manu Sporny wrote:
> [...]
> [17:37:12] Shane McCarron: we have no option about well formedness of
> xml literals.  its a requirement
> [17:37:18] … not our requirement.  RDF

Slightly off-topic: RDF seems to always require canonicalisation 
(http://www.w3.org/TR/rdf-concepts/#dfn-rdf-XMLLiteral - "encoding as 
UTF-8 yields exclusive Canonical XML (with comments, with empty 
InclusiveNamespaces PrefixList)"). 
http://www.w3.org/2006/07/SWD/RDFa/testsuite/xhtml1-testcases/0011.sparql 
seems to ignore that requirement since it allows different ways of 
serialising the XML. Is that intentional?

More general comment: How is this different to XMLLiterals in 
RDFa-in-XHTML? When you're implementing that, you can't just copy bytes 
from the data input stream directly - you at least have to insert xmlns 
declarations to ensure the output is namespace-well-formed. And if the 
XMLLiteral contains some &entity; that's defined in the XHTML page's DTD 
then it will have to be expanded out so that it's correct once it's 
separated from the DTD, and so on. As far as I can see, that's pretty 
much impossible to implement unless you parse the whole page with an XML 
parser and then use an XML serialisation algorithm on an element 
sub-tree to get the XMLLiteral.

It seems logical to me that RDFa-in-HTML should work the same way - you 
parse the whole page with an HTML parser and then use exactly the same 
XML serialisation algorithm as before.

(I understand that it may be impossible to specify that behaviour if 
you're relying on HTML4, since HTML4 doesn't specify how to parse into a 
structure that can be serialised as XML; but that's why I'd want to base 
it on the HTML5 parsing algorithm instead, which makes this all quite 
easy :-) )

> [17:36:32] … I think we have two really nasty issues right now: The
> first being xmlns + case sensitivity: both of which are fixed if we move
> to @prefix and declare that prefix is always case-sensitive (and
> implement the legacy case-insensitivity stuff for xmlns that we've been
> talking about in the community over the past couple of days).

Related the two issues, a consequence of implementing XMLLiterals using 
HTML5's parsing and XML-serialisation algorithms is that content like:

   <div property="..."><span xmlns:foo="..."></span></div>

would fail to generate an XMLLiteral. The 'xmlns:foo' gets parsed into 
an attribute with local name "xmlns:foo" in no namespace. That local 
name is not an NCName, so it's impossible to serialise as XML, and the 
XML serialisation algorithm will fail.

Some possible solutions for this issue:

* Change the HTML5 parsing algorithm so xmlns:foo gets local name "foo" 
in the XML Namespaces namespace. (That seems very unlikely to happen, 
because of backward-compatibility issues with existing content.)

* Add some ugly hacks in the serialisation process, e.g. find all 
attributes named "xmlns:foo" and pretend they were called "foo" in the 
XML Namespaces namespace while serialising.

* Don't support XMLLiterals.

* Discourage the use of xmlns:foo attributes, and replace them with 
@prefix or something.

(In all but the last of those cases, xmlns:foo would still be a problem 
for any other tool that attempts to convert HTML to XML (using HTML5's 
parsing rules), e.g. http://services.philip.html5.org/html-to-xhtml/ 
strips out the attributes entirely because they can't be represented in 
XML.)

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Tuesday, 26 May 2009 08:34:54 UTC