- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Thu, 17 Sep 2009 01:40:37 +0100
- To: Ben Adida <ben@adida.net>
- CC: Ivan Herman <ivan@w3.org>, public-rdf-in-xhtml-tf@w3.org
Ben Adida wrote: > Ivan, Philip, > > I had an action to follow up on this issue, but looking at this thread > I'm a little bit confused. > >> The bottomline is that you are right. Neither N3/Turtle nor SPARQL >> includes any automatic canonicalization of XML Literals (in contrast to >> RDF/XML), nor will the new version of SPARQL do it. > > So, this is a bit disconcerting. As I mentioned on the call, does RDFa > need to worry about float canonicalization, too? Does SPARQL consider > "2.0"^^xsd:float different from "2.00"^^xsd:float? I'll try to explain how I understand these things, and see if it makes anything clearer (or if anyone can tell me I'm wrong - I had a few significant misconceptions that I noticed when reading these documents again, and probably have some more): (also, apologies if I just make things more confused!) RDF datatypes (http://www.w3.org/TR/rdf-concepts/#section-Datatypes) have a value space (the set of all possible values of that type) and a lexical space (a set of strings that represent values from the value space), and a mapping from the lexical space to the value space (effectively a parser for that datatype). In the abstract RDF syntax (http://www.w3.org/TR/rdf-concepts/#section-Graph-syntax), triples contain typed literals which have a lexical form (a string). Literals are equal only if their lexical forms are equal. E.g., in the abstract RDF model, the literal with lexical form "2.0" and datatype xsd:float is different from the literal with lexical form "2.00" and datatype xsd:float. The N3/Turtle serialisation of an RDF triple has a 1:1 mapping between a literal's lexical form and the N3/Turtle string serialisation of it, so any discussion of abstract lexical forms applies equally to the strings in concrete N3/Turtle syntax. RDF says it is "in error" (but "not syntactically ill-formed") if a literal has a lexical form which is not in the lexical space of its datatype. So the literal "xyzzy" with datatype xsd:float can exist, but is in error, and no value (from the value space) is associated with the literal. But "2.0" and "2.00" are both in the lexical space, and both associated with the value 2. The lexical space of rdf:XMLLiteral is defined to be strings which are exclusive canonical XML (and no other strings). So the literal with lexical form "<x></x>" with datatype rdf:XMLLiteral is fine (it's in the lexical space, and is associated with the value "<x></x>"), but the literal with lexical form "<x />" is in error (it's not canonical XML, so it's not in the lexical space) and no value is associated with the literal (so it is effectively meaningless). When an RDFa processor generates an RDF triple with a typed literal, it needs to decide what lexical form to give the literal. For all non-XMLLiteral types (e.g. xsd:float), the input document supplies a string (via @content or the concatenation of child text nodes), and clearly the lexical form should simply be that string (even if it's not in the lexical space and is therefore in error), because no other behaviour would be sane or possible (given that the RDFa processor doesn't know the details of all datatypes anyone might ever use, and it would be ugly to have just a few special cases). For XMLLiteral, the lexical form of the literal is constructed by serialising the child nodes from the input document. Since the RDFa processor has special knowledge of XMLLiteral and already has control over the lexical form it generates, it really ought to avoid the "in error" case (i.e. strings which are not in the lexical space, i.e. non-canonical XML) by ensuring it does a proper canonical serialisation. So... Canonicalisation of XMLLiterals is only relevant because that's how their lexical space is defined and because the RDFa processor ought to avoid avoidable erroneous output. The lexical space of xsd:float allows lots of different strings for each value, and there's no requirement for them to be in any kind of canonical form, so canonicalisation of floats is irrelevant (at least as far as RDFa is concerned). > Does SPARQL consider > "2.0"^^xsd:float different from "2.00"^^xsd:float? RDF considers them to be different literals, but SPARQL specially defines the '=' operator over numeric types (including xsd:float) to perform a numeric comparison of the values of the literals (where the values are derived by applying the datatype's lexical-to-value mapping to the literal's lexical form). So SPARQL thinks they're equal according to '=' (I don't know if it has other different notions of equality too). > And if SPARQL *does* canonicalize floats, then why wouldn't it also > canonicalize XMLLiterals? It doesn't canonicalise floats, it just maps them onto the value space (converting them from strings into numbers) before comparing them. The strings can validly be anything in the lexical space of xsd:float, which includes both "2.0" and "2.00". Similarly the strings for XMLLiterals can validly be anything in the lexical space of rdf:XMLLiteral, which is defined to only include exclusive canonical XML strings. > My action was to express the sentiment that this is not part of the RDFa > scope: we're just parsing a syntax and creating and RDF graph with typed > values. I still think that's the case. In the non-XMLLiteral case, I don't think RDFa needs to say anything (the lexical form is simply the input string). For XMLLiteral, implementations ought to ensure the lexical form of the literal is a valid member of the lexical space of XMLLiteral, i.e. is an exclusive canonical XML string, and I think it may help implementers if this was explicit in the RDFa spec where it tells them to serialise the element contents. (A concrete implementation of this might not necessarily do the canonicalisation itself, e.g. if the output syntax is RDF/XML then it can happily represent the abstract notion of lexical form with some non-canonical XML markup because RDF/XML defines it to be equivalent. But when the output syntax is N3/Turtle, which has a 1:1 mapping between serialisation and the lexical form, the serialisation does need to really be exclusive canonical XML.) Examples and test cases should never define invalid triples, i.e. triples with lexical forms that are not in their datatype's lexical space. (That doesn't affect any examples in the RDFa-in-XHTML spec, but it does in HTML+RDFa, and it affects some tests.) I think that's about all there is to it ;-) -- Philip Taylor pjt47@cam.ac.uk
Received on Thursday, 17 September 2009 00:41:19 UTC