Re: Exact format for XML Literals? from Ivan Herman on 2009-09-14 (public-rdf-dawg@w3.org from July to September 2009)

From: Ivan Herman <ivan@w3.org>
Date: Mon, 14 Sep 2009 12:06:53 +0200
To: Bijan Parsia <bparsia@cs.man.ac.uk>
CC: "Seaborne, Andy" <andy.seaborne@hp.com>, Axel Polleres <axel.polleres@deri.org>, W3C SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4AAE15BD.4060101@w3.org>
Hi Bijan,

Bijan Parsia wrote:
> On 14 Sep 2009, at 05:58, Ivan Herman wrote:
> 
>> Andy,
>>
>> Here is a concrete example. Say our data is:
>>
>> <rdf:RDF xmlnsrdf="..." xmlns:ex="...">
>> <rdf:Description rdf:about="">
>>   <ex:p rdf:parseType="Literal">
>>      <ex:bla1   a="something" q="and" b="something else"    />
>>   </ex:p>
>> </rdf:Description>
>> </rdf:RDF>
>>
>> My question is: what is the result of
>>
>> PREFIX ex: <...>
>> ASK WHERE {
>>   ?a ex:p
>>     "<ex:bla1 q="and"
>>         b="something else"     a="something"/>^^rdf:Literal .
>> }
>>

(I just realized that I wanted to use rdf:XMLLiteral in the example.
Sorry about that...)

>> My feeling is that the answer should be 'true', regardless of the fact
>> that the two literals are different in the order of the attributes and
>> the usage of white spaces.
> 
> Since comparisons are normally in "term" space, i.e., lexical space, my
> feeling is different.
> 

Hm. We really do have different feelings:-).

So if I have the data as

<> ex:a "1.00"^^xsd:float .

then

ASK WHERE { ?a ex:b "1.0"^^xsd:float . }

should return false? Is it then in the realm of the entailement regimes
in the sense that it would require D-entailement to be able to say
'true'? That may well be the answer (and we may want to think about this
when discussing entailement regimes)...

>> The RDF/XML spec explicitly says that, in the case above, the XML part
>> is transformed into the 'correct' lexical form when creating the
>> abstract RDF triple (which is defined in the term of canonicalized XML).
> 
> That seems to be a bug in RDF/XML, frankly. The lexical space of
> XMLLiteral is *not* the canonicalized form and I don't see why the parse
> phase should say anything about it. (Do systems generally adhere to this
> part of the spec?) No other datatype, to my knowledge, *requires*
> canonicalization (though XML Schema 1.1 provides for a canonicalization
> for all of them, I believe).
> 
> http://www.w3.org/TR/2003/WD-rdf-concepts-20030123/#dfn-rdf-XMLLiteral
> 
> """The lexical spacecontains all pairs ( string, lang ) where lang is
> any language identifier [RFC-3066] in lowercase, and string is
> well-balanced, self-contained XML element content [XML], for which the
> XML document corresponding to the pair is a well-formed XML document
> [XML] that also conforms to XML Namespaces [XML-NS]."""
> 

The recs says:

[[[
The lexical space
is the set of all strings:
  - which are well-balanced, self-contained XML content [XML];
  - for which encoding as UTF-8 [RFC 2279] yields exclusive Canonical
XML (with comments, with empty InclusiveNamespaces PrefixList )
[XML-XC14N];
  - for which embedding between an arbitrary XML start tag and an end
tag yields a document conforming to XML Namespaces [XML-NS]
]]] http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-XMLLiteral

(you seem to have referred to the WD).

I am not arguing at this point whether this is right or wrong (see
below). And indeed you are right that no other datatype requires some
sort of a canonicalization.

> 
> But even if you buy that coming from RDF/XML you'll end up with
> canonicalized lexical forms, not every source must do that. AFAICT,
> SPARQL is silent on canonicalization...XMLLiteral is just another
> datatyped literal. So those would definitely not match.
> 

O.k., I agree with your analysis that SPARQL is silent on that. Then my
question is, in fact: is this o.k.? Shouldn't SPARQL do the same as
RDF/XML that explicitly refers to canonicalization?

If not, the only way I could get a 'yes' answer to my original question
would be to canonicalize the whole thing myself, ie:

PREFIX ex: <...>
ASK WHERE {
  ?a ex:p
    "<ex:bla1 xmlns:ex="..." a="something" b="something else"
q="and"/>^^rdf:XMLLiteral .
}

(As an aside, Andy asked what the RDFa use case is. Well, the full RDFa
test suite is based on using SPARQL ASK on the outcome of RDFa
processing, and the question was what exactly should be put into the
SPARQL code for tests related to XMLLiteral generation, as well as the
corresponding Turtle examples.)

>> Does the SPARQL spec says the same?
>>
>> Note that this is _not_ the case as if we replaced the two literals
>> with, say, 1.0 and 1.00 declaring both to be floats. The way XML
>> Literal is currently defined is such that the lexical form (not the
>> value space!) is the canonical XML version.
> 
> This is false. See above.

I am not sure what 'This' refers to here...

> If it were true, then semantically the first graph would have a
> not-well-formed literal, thus, semantically, would not be an instance of
> rdfs:Literal.
> 

I am not sure I understand that.:-(

>> Ie, by referring to the fact that the comparison of literal should be
>> done in the value space does not cover the XML Literal case.
> 
> ? Er...you mean that the comparison should be done in the lexical space
> cuts no ice? But surely it does :)
> 
> How about errata on RDF XML syntax and RDF concepts to change, in the
> former case, the parsing to simply check for well formedness (with
> namespaces) and the latter to make the value and the lexical spaces
> identical. We can then add functions such as
> "equalUnderCanonicalization" which would apply to *any* datatype,
> 
> I know your default reply ("it's impossible") 

:-)

>                                                 but is it really? Does
> anyone really think these things *aren't* bugs (at the very least, the
> generality of the lexical form is in tension with the strictness of the
> parsing spec)? Plus, it *removes* code.
> 

Reporting a bug to the RDF document is perfectly possible. But this
should be done by trying to understand how this part of the REC was
created, ie, contacting the original editors, and probably refer to the
community in some way or other. It is, however, on the borderline
whether this is a bug or a change in the Rec; the latter may become more
touchy indeed.

Personally, I do not really understand the reasons of this definition
either. I am not a very good experts in XML, but I would have expected
the lexical space of XMLLiteral to be well formed XML, and the value
space to be the canonical XML version, or maybe even the Infoset as an
abstract representation of the XML content. But, again, I am not an
expert on all the details of XML:-(

> Cheers,
> Bijan.
> 
> Cheers,
> Bijan.

Cheers back (twice:-)

Ivan


-- 

Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
PGP Key: http://www.ivan-herman.net/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Monday, 14 September 2009 10:07:32 UTC