Re: XMLLiterals and c14n from Philip Taylor on 2009-09-17 (public-rdf-in-xhtml-tf@w3.org from September 2009)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Thu, 17 Sep 2009 01:40:37 +0100
To: Ben Adida <ben@adida.net>
CC: Ivan Herman <ivan@w3.org>, public-rdf-in-xhtml-tf@w3.org
Message-ID: <4AB18585.2020306@cam.ac.uk>
Ben Adida wrote:
> Ivan, Philip,
> 
> I had an action to follow up on this issue, but looking at this thread
> I'm a little bit confused.
> 
>> The bottomline is that you are right. Neither N3/Turtle nor SPARQL
>> includes any automatic canonicalization of XML Literals (in contrast to
>> RDF/XML), nor will the new version of SPARQL do it.
> 
> So, this is a bit disconcerting. As I mentioned on the call, does RDFa
> need to worry about float canonicalization, too? Does SPARQL consider
> "2.0"^^xsd:float different from "2.00"^^xsd:float?

I'll try to explain how I understand these things, and see if it makes 
anything clearer (or if anyone can tell me I'm wrong - I had a few 
significant misconceptions that I noticed when reading these documents 
again, and probably have some more): (also, apologies if I just make 
things more confused!)

RDF datatypes (http://www.w3.org/TR/rdf-concepts/#section-Datatypes) 
have a value space (the set of all possible values of that type) and a 
lexical space (a set of strings that represent values from the value 
space), and a mapping from the lexical space to the value space 
(effectively a parser for that datatype).

In the abstract RDF syntax 
(http://www.w3.org/TR/rdf-concepts/#section-Graph-syntax), triples 
contain typed literals which have a lexical form (a string). Literals 
are equal only if their lexical forms are equal. E.g., in the abstract 
RDF model, the literal with lexical form "2.0" and datatype xsd:float is 
different from the literal with lexical form "2.00" and datatype xsd:float.

The N3/Turtle serialisation of an RDF triple has a 1:1 mapping between a 
literal's lexical form and the N3/Turtle string serialisation of it, so 
any discussion of abstract lexical forms applies equally to the strings 
in concrete N3/Turtle syntax.

RDF says it is "in error" (but "not syntactically ill-formed") if a 
literal has a lexical form which is not in the lexical space of its 
datatype. So the literal "xyzzy" with datatype xsd:float can exist, but 
is in error, and no value (from the value space) is associated with the 
literal. But "2.0" and "2.00" are both in the lexical space, and both 
associated with the value 2.

The lexical space of rdf:XMLLiteral is defined to be strings which are 
exclusive canonical XML (and no other strings). So the literal with 
lexical form "<x></x>" with datatype rdf:XMLLiteral is fine (it's in the 
lexical space, and is associated with the value "<x></x>"), but the 
literal with lexical form "<x />" is in error (it's not canonical XML, 
so it's not in the lexical space) and no value is associated with the 
literal (so it is effectively meaningless).


When an RDFa processor generates an RDF triple with a typed literal, it 
needs to decide what lexical form to give the literal. For all 
non-XMLLiteral types (e.g. xsd:float), the input document supplies a 
string (via @content or the concatenation of child text nodes), and 
clearly the lexical form should simply be that string (even if it's not 
in the lexical space and is therefore in error), because no other 
behaviour would be sane or possible (given that the RDFa processor 
doesn't know the details of all datatypes anyone might ever use, and it 
would be ugly to have just a few special cases).

For XMLLiteral, the lexical form of the literal is constructed by 
serialising the child nodes from the input document. Since the RDFa 
processor has special knowledge of XMLLiteral and already has control 
over the lexical form it generates, it really ought to avoid the "in 
error" case (i.e. strings which are not in the lexical space, i.e. 
non-canonical XML) by ensuring it does a proper canonical serialisation.


So... Canonicalisation of XMLLiterals is only relevant because that's 
how their lexical space is defined and because the RDFa processor ought 
to avoid avoidable erroneous output. The lexical space of xsd:float 
allows lots of different strings for each value, and there's no 
requirement for them to be in any kind of canonical form, so 
canonicalisation of floats is irrelevant (at least as far as RDFa is 
concerned).

> Does SPARQL consider
> "2.0"^^xsd:float different from "2.00"^^xsd:float?

RDF considers them to be different literals, but SPARQL specially 
defines the '=' operator over numeric types (including xsd:float) to 
perform a numeric comparison of the values of the literals (where the 
values are derived by applying the datatype's lexical-to-value mapping 
to the literal's lexical form). So SPARQL thinks they're equal according 
to '=' (I don't know if it has other different notions of equality too).

> And if SPARQL *does* canonicalize floats, then why wouldn't it also
> canonicalize XMLLiterals?

It doesn't canonicalise floats, it just maps them onto the value space 
(converting them from strings into numbers) before comparing them. The 
strings can validly be anything in the lexical space of xsd:float, which 
includes both "2.0" and "2.00".

Similarly the strings for XMLLiterals can validly be anything in the 
lexical space of rdf:XMLLiteral, which is defined to only include 
exclusive canonical XML strings.

> My action was to express the sentiment that this is not part of the RDFa
> scope: we're just parsing a syntax and creating and RDF graph with typed
> values. I still think that's the case.

In the non-XMLLiteral case, I don't think RDFa needs to say anything 
(the lexical form is simply the input string). For XMLLiteral, 
implementations ought to ensure the lexical form of the literal is a 
valid member of the lexical space of XMLLiteral, i.e. is an exclusive 
canonical XML string, and I think it may help implementers if this was 
explicit in the RDFa spec where it tells them to serialise the element 
contents.

(A concrete implementation of this might not necessarily do the 
canonicalisation itself, e.g. if the output syntax is RDF/XML then it 
can happily represent the abstract notion of lexical form with some 
non-canonical XML markup because RDF/XML defines it to be equivalent. 
But when the output syntax is N3/Turtle, which has a 1:1 mapping between 
serialisation and the lexical form, the serialisation does need to 
really be exclusive canonical XML.)

Examples and test cases should never define invalid triples, i.e. 
triples with lexical forms that are not in their datatype's lexical 
space. (That doesn't affect any examples in the RDFa-in-XHTML spec, but 
it does in HTML+RDFa, and it affects some tests.)

I think that's about all there is to it ;-)

-- 
Philip Taylor
pjt47@cam.ac.uk
Received on Thursday, 17 September 2009 00:41:19 UTC