Re: Exact format for XML Literals? from Ivan Herman on 2009-09-14 (public-rdf-dawg@w3.org from July to September 2009)

From: Ivan Herman <ivan@w3.org>
Date: Mon, 14 Sep 2009 12:15:06 +0200
To: "Seaborne, Andy" <andy.seaborne@hp.com>
CC: Axel Polleres <axel.polleres@deri.org>, W3C SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4AAE17AA.90307@w3.org>
Seaborne, Andy wrote:
> 
>> -----Original Message-----
>> From: Ivan Herman [mailto:ivan@w3.org]
>> Sent: 14 September 2009 05:59
>> To: Seaborne, Andy
>> Cc: Axel Polleres; W3C SPARQL Working Group
>> Subject: Re: Exact format for XML Literals?
>>
>> Andy,
>>
>> Here is a concrete example. Say our data is:
>>
>> <rdf:RDF xmlnsrdf="..." xmlns:ex="...">
>> <rdf:Description rdf:about="">
>>     <ex:p rdf:parseType="Literal">
>>        <ex:bla1   a="something" q="and" b="something else"    />
>>     </ex:p>
>> </rdf:Description>
>> </rdf:RDF>
>>
>> My question is: what is the result of
>>
>> PREFIX ex: <...>
>> ASK WHERE {
>>     ?a ex:p
>>       "<ex:bla1 q="and"
>>           b="something else"     a="something"/>^^rdf:Literal .
>> }
>>
>> My feeling is that the answer should be 'true', regardless of the fact
>> that the two literals are different in the order of the attributes and
>> the usage of white spaces. The RDF/XML spec explicitly says that, in the
>> case above, the XML part is transformed into the 'correct' lexical form
>> when creating the abstract RDF triple (which is defined in the term of
>> canonicalized XML). Does the SPARQL spec says the same?
> 
> The RDF/XML spec says that as that because it is necessary because of XML processing concerns.   It does not apply to the case above because there is no XML processing (e.g. XML namespaces set further out).
> 

Canonical XML would require the attributes to be in an alphanumeric
order (ie, a="something" b="something else" q="and") and no extra space
between '<' and '/>'. Do you mean to say that the RDF/XML code above is
not legal XML Literal?

> The definition of XMLLiteral makes no mention of transformation.  It simply defines the lexical space.  It's just an illegal XMLLiteral otherwise, much like "foo"^^xsd:decimal.
> 
>> Note that this is _not_ the case as if we replaced the two literals
>> with, say, 1.0 and 1.00 declaring both to be floats.
> 
> A SPARQL processor may or may not equate "1.0"^^xsd:float and "1.00"^^xsd:float as floats in a graph pattern.  They are "=" in a FILTER.  The design does not require xsd:float understanding in the triple pattern matching; it allows a system to canonicalize on input or not, it allows comparison to be value sensitive or not if the graph is not holding canonical lexical forms (see the BGP extension section).
>  
>> The way XML Literal
>> is currently defined is such that the lexical form (not the value
>> space!) is the canonical XML version.
> 
> Yes.
> 
>> Ie, by referring to the fact that
>> the comparison of literal should be done in the value space does not
>> cover the XML Literal case.
> 
> No - the defn of XML literal says the lexical space is canonicalized XML but it does not say it must be transformed in any way.  The parsing of RDF/XML is not relevant unless the characters come from an RDF/XML document where the namespaces etc need to pushed down for consistency.
> 
> The example is an illegal XML literal in the same way "foo"^^xsd:decimal or "<b>"^^rdf:XMLLiteral is illegal - the string part is not in the lexical space.  It's outside the definition of the datatype (and a graph containing it is inconsistent) - the processor might do something about that but it's fixing up an error, not required to by spec.
> 
> 
> 
> Coming back to my original question - why isn’t the XML literal supplied in canonical form if it's from RDFa?
> 

As I said in my other mail: the question that came up in relation to
RDFa is how the SPARQL based test cases should be properly formulated.
That is how we got to this issue...

Ivan





>  Andy
> 
>> Cheers
>>
>> Ivan
>>
>> Seaborne, Andy wrote:
>>> Ivan,
>>>
>>> What is the use case from RDFa?  Can we have a concrete example to discuss?
>>> In particular, why is the literal given not already canonicalized when
>> forming the query?
>>> SPARQL already allows bad lexical forms ("hello"^^xsd:decimal) - the
>> definition of the datatype says something and the data is wrong with respect
>> to that in the same way as with XMLLiteral.
>>>
>>> There are other ways to consider such as provide an explicit operation to
>> produce a canonical form:
>>> { ?s ?p ?o .
>>>   FILTER (?o = XC14N("bla   b='something' a='else'>and
>> else</bla>"^^rdf:XMLLiteral)
>>> }
>>>
>>> At the moment, a SPARQL engine is not required to have special
>> understanding of XML-Literals in FILTERs.  We could document what XMLLiteral
>> casting means and that it includes canonicalization (or be a warning/error -
>> more consistent - in which case have a "canonical" function).
>>> { ?s ?p ?o .
>>>   FILTER (?o = rdf:XMLLiteral("bla   b='something' a='else'>and
>> else</bla>")
>>> }
>>>
>>> (defintion of XMLLiteral)
>>>>>> [[[
>>>>>> The lexical space is the set of all strings:
>>>>>> - which are well-balanced, self-contained XML content [XML];
>>>>>> - for which encoding as UTF-8 [RFC 2279] yields exclusive Canonical XML
>>>>>> [...][XML-XC14N]
>>>>>> - for which embedding between an arbitrary XML start tag and an end tag
>>>>>> yields a document conforming to XML Namespaces [XML-NS]
>>>>>> ]]]
>>> The definition defines the lexical space as a set of strings which are UTF-
>> 8 encoded canonical forms and says nothing outside that.  It does not say
>> canonicalization must be applied to produce a legal lexical form from
>> otherwise illegal forms.
>>> This seems the same to me as the way XSD primitive datatypes are defined
>> [3] e.g.
>>> [[[
>>> 3.2.3.1 Lexical representation
>>>
>>> decimal has a lexical representation consisting of a finite-length sequence
>> of decimal digits (#x30-#x39) separated by a period as a decimal indicator.
>> An optional leading sign is allowed. If the sign is omitted, "+" is assumed.
>> Leading and trailing zeroes are optional. If the fractional part is zero, the
>> period and following zero(es) can be omitted. For example: -1.23,
>> 12678967.543233, +100000.00, 210.
>>> ]]]
>>>
>>>>>> Note that the RDF/XML specification goes a little bit further: in point
>>>>>> 7.2.17 of the RDF/XML spec[2] it explicitly
>>>>>>
>>>>>> [[[
>>>>>> l is transformed into the lexical form of an XML literal in the RDF
>> graph
>>>>>> ]]]
>>>>>>
>>>>>> and refers to the XC14N algorithm explicitly. Ie, the XML extract above
>>>>>> is perfectly valid for RDF/XML. However, the current SPARQL spec is
>>>>>> silent about this.
>>> This text in the RDF/XML Syntax Specification and applies to RDF/XML syntax
>> and to parsing RDF/XML.
>>> It makes sense to me in the context of XML processing because in XML there
>> are external (in the character string being processed) factors like namespace
>> and language which nest in the whole document.  SPARQL isn't in the same
>> situation.
>>>  Andy
>>>
>>> [3] http://www.w3.org/TR/xmlschema-2/#decimal
>>>
>>>> -----Original Message-----
>>>> From: public-rdf-dawg-request@w3.org [mailto:public-rdf-dawg-
>> request@w3.org]
>>>> On Behalf Of Ivan Herman
>>>> Sent: 09 September 2009 11:52
>>>> To: Axel Polleres
>>>> Cc: W3C SPARQL Working Group
>>>> Subject: Re: Exact format for XML Literals?
>>>>
>>>> Axel, that quote is in the RDF Concept standard[1], the SPARQL group
>>>> will not change that...
>>>>
>>>> What I think we ought to do is to put something like the RDF/XML spec
>>>> says, ie, that the literal in the graph pattern is 'transformed' into an
>>>> RDF XML Literal.
>>>>
>>>> Ivan
>>>>
>>>>
>>>>
>>>> [1] http://www.w3.org/TR/rdf-concepts
>>>>
>>>> Axel Polleres wrote:
>>>>> I guess just dropping
>>>>> "
>>>>>> - for which encoding as UTF-8 [RFC 2279] yields exclusive Canonical XML
>>>>>> [...][XML-XC14N]
>>>>> "
>>>>> is not sufficient?
>>>>>
>>>>> I.e. aren't the first and third item enough?
>>>>> What do I miss here?
>>>>>
>>>>> Thanks,
>>>>> Axel
>>>>>
>>>>> On 8 Sep 2009, at 08:24, Ivan Herman wrote:
>>>>>
>>>>>> Guys,
>>>>>>
>>>>>> an issue came up in the RDFa task force that has relevance on the SPARQL
>>>>>> syntax. It may be that this will lead to a need to tighten up the SPARQL
>>>>>> language specification's language (no new feature here). It is related
>>>>>> to the way XML Literals are represented in the query language (well,
>>>>>> essentially, in Turtle...). The question is whether the following
>>>>>> extract is valid or not:
>>>>>>
>>>>>> a:bla b:blabla
>>>>>>  "<bla   b='something' a='else'>and else</bla>"^^rdf:XMLLiteral.
>>>>>>
>>>>>> The lexical space of XML Literal is defined by the RDF concept document
>>>>>> and it says:
>>>>>>
>>>>>> [[[
>>>>>> The lexical space is the set of all strings:
>>>>>> - which are well-balanced, self-contained XML content [XML];
>>>>>> - for which encoding as UTF-8 [RFC 2279] yields exclusive Canonical XML
>>>>>> [...][XML-XC14N]
>>>>>> - for which embedding between an arbitrary XML start tag and an end tag
>>>>>> yields a document conforming to XML Namespaces [XML-NS]
>>>>>> ]]]
>>>>>>
>>>>>> the important point is the usage of XC14N. A cursory read of this text
>>>>>> would mean that, in SPARQL, one would have to write a canonical XML for
>>>>>> an XML Literal (which is not the case in the case above).
>>>>>>
>>>>>> Note that the RDF/XML specification goes a little bit further: in point
>>>>>> 7.2.17 of the RDF/XML spec[2] it explicitly
>>>>>>
>>>>>> [[[
>>>>>> l is transformed into the lexical form of an XML literal in the RDF
>> graph
>>>>>> ]]]
>>>>>>
>>>>>> and refers to the XC14N algorithm explicitly. Ie, the XML extract above
>>>>>> is perfectly valid for RDF/XML. However, the current SPARQL spec is
>>>>>> silent about this.
>>>>>>
>>>>>> It is fairly obvious that the same should happen in SPARQL (and in
>>>>>> Turtle): the parser should, conceptually, apply a canonicalization
>>>>>> algorithm on the XML content in the literal. But it may be better to say
>>>>>> that explicitly in the document, similarly to RDF/XML...
>>>>>>
>>>>>> Do I miss something?
>>>>>>
>>>>>> Ivan
>>>>>>
>>>>>> [1] http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral
>>>>>> [2] http://www.w3.org/TR/rdf-syntax-grammar/#section-grammar-productions
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>>>> Home: http://www.w3.org/People/Ivan/
>>>>>> mobile: +31-641044153
>>>>>> PGP Key: http://www.ivan-herman.net/pgpkey.html
>>>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>>> --
>>>>> Dr. Axel Polleres
>>>>> Digital Enterprise Research Institute, National University of Ireland,
>>>>> Galway
>>>>> email: axel.polleres@deri.org <mailto:axel.polleres@deri.org>  url:
>>>>> http://www.polleres.net/
>>>>>
>>>>>
>>>>>
>>>> --
>>>>
>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>> Home: http://www.w3.org/People/Ivan/
>>>> mobile: +31-641044153
>>>> PGP Key: http://www.ivan-herman.net/pgpkey.html
>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>> --
>>
>> Ivan Herman, W3C Semantic Web Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> PGP Key: http://www.ivan-herman.net/pgpkey.html
>> FOAF: http://www.ivan-herman.net/foaf.rdf

-- 

Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
PGP Key: http://www.ivan-herman.net/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Monday, 14 September 2009 10:24:00 UTC