Re: Closing ISSUE-13 from Pat Hayes on 2012-05-11 (public-rdf-wg@w3.org from May 2012)

From: Pat Hayes <phayes@ihmc.us>
Date: Fri, 11 May 2012 01:24:25 -0500
To: Richard Cyganiak <richard@cyganiak.de>
Cc: David Wood <david@3roundstones.com>, public-rdf-wg Group WG <public-rdf-wg@w3.org>
Message-Id: <2E439D19-7102-49A3-8551-3FBC6876AA51@ihmc.us>
Thanks for taking the trouble to exaplin, but I still don't get it. 

On May 10, 2012, at 4:43 PM, Richard Cyganiak wrote:

> Pat,
> 
> The canonical mapping simply flags some of the lexical forms (exactly one per value) as canonical. That is all. This implies no expectation that consumers will find canonicalized data, and it implies no expectation that producers should canonicalize.

Then I fail to see what the point is of even having the idea mentioned in the definition of the datatype. I am worried that some readers will assume that some expectations are being implied

, for otherwise why mention the matter? 

> 
> Let me give you an example.

OK

> In the RDB2RDF WG, we get values out of a relational database, convert them to XSD datatypes, and stick them into IRIs so that we get IRIs that identify the database records. Like <http://example.com/mytable/ID=42>. Now the thing is, as far as we know, the database could give us 42 or +42 or 042 — these are all equal and would all be semantically correct results to a database query. This may be a non-issue for integer columns, but it is a very real problem for decimals, floats and date/time types. To ensure that different implementations would produce equal IRIs from equal values, we need to canonicalize.

Sorry, but this seem to be exactly an example of canonicalization indeed coming along with expectations. As you point out, there is a NEED to canonicalize here. This seems to be exactly counter to your point, above, which you are telling me it illustrates. 

> Conveniently, the XSD spec defines canonical mappings for all these XSD datatypes. So our spec simply says that you MUST use the canonical form of the literal when creating these IRIs. So an implementation that produces <mytable/ID=42> conforms, while <mytable/ID=+42> or <mytable/ID=042> don't.

> 
> Another example. Some RDF stores like to canonicalize on input, because that's cheaper than comparing at query time. A store that wants to go the whole way and properly support value-based comparison for all datatypes including rdf:XMLLiteral will need *some* way of canonicalizing literals. And certainly it would be convenient for the implementer if the RDF specs already points to a method for canonicalizing these literals. Users of the RDF stores may never know any of this — the implementation may just apply canonicalization, create a hash of the result, store that in its index, and use it for comparisons.
> 
> Now, coming from another direction: I note that RDF Semantics has lots of content that isn't *necessary* for anything. For example, let's look at the notion of leanness. It is completely unnecessary for explaining entailment. And there's no way for a data publisher to indicate that they've gone to all the effort of publishing lean graphs. So users of an RDF graph will never know if it is lean. So why bother? It just seems pointless and confuses things. But we could define two syntaxes (Turtle and LTurtle) so that publishers can show that their data is nice and tidy.
> 
> Yes, that proposal completely misses the point of leanness — it's simply a property that some RDF graphs have while others don't, and that property is sometimes useful when reasoning (in the brainy/talky sense) about RDF.
> 
> Same with canonicalization — it's the kind of thing that *some* implementers or writers of derived specs will need/want, and since we have the notion already (from RDF 2004 where it was required), we can just as well include it, rather than forcing everyone to independently rediscover that there was a canonicalization mechanism described in the RDF 2004 spec that got cut out for RDF 1.1

By all means let us mention it, even draw attention to it. But I see no reason to include this (purely informative) information in the very definition of the datatype. It is completely irrelevant to the definition of the actual datatype; including it there seems to imply a relevance which is illusory. In your analogy to leannness, above, we did not include the definition of leanness inside the definition of RDF graph. 

Pat

> 
> Best,
> Richard
> 
> 
> 
> On 10 May 2012, at 22:00, Pat Hayes wrote:
> 
>> Um. Can I raise the issue I was trying (but failing) to clarify through IRC yesterday? It concerns canonical lexical forms. I may be simply not understanding the issue, in which case please someone tell me so. 
>> 
>> Seems to me that if we define a canonical lexical form but also make it optional, then this is worse than just not mentioning it at all. Consider A who publishes some data and B who queries the data. No matter what A does, B does not know that the data is in canonical form, since conformity does not require this. So, B cannot rely on its being canonical, and must proceed cautiously, under the presumption that some data might be uncanonical, and process it (at his expense) to allow for this. And, to repeat, B must do this even if A has, in fact, taken the trouble to canonicalize all the data. Now, A can figure this out himself ahead of time, and so can conclude, correctly, that it is simply not worth the trouble to put his data into canonical form, since users of it will never know it is, and will have to treat it as potentially uncanonical anyway, so why bother? So, it seems to me, an optional caonicalization is simply pointless. It serves only to confuse things. 
>> 
>> One way out of this problem is to *require* canonical lexical forms, but this puts all the onus onto the publisher. Another possible way to proceed is to have two closely related datatypes, one of them just like the other but with its lexical form compulsory. So we might have rdf:CXMLLiteral whose specification is exactly like rdf:XMLLiteral except that the lexical form is *required* to be http://www.w3.org/TR/xml-exc-c14n/ . The point being that this enables A to, in effect, publish the fact that his data is nice and tidy, so that B can rely on it. AFAIKS, this idea is essentially zero cost to implementors and users, and has a very small cost to the WG.  
>> 
>> But as I say, I might be missing the whole point. 
>> 
>> Pat
>> 
>> 
>> On May 9, 2012, at 11:45 AM, David Wood wrote:
>> 
>>> Hi all,
>>> 
>>> Today, we resolved [1]:
>>> [[
>>> RESOLVED: in RDF 1.1: [a] XMLLiterals are optional; [b] lexical space consists of well-formed XML fragments; [c] the canonical lexical form is http://www.w3.org/TR/xml-exc-c14n/, as defined in RDF 2004; [d] the value space consists of (normalized) DOM trees.
>>> ]]
>>> 
>>> Richard's proposal [2], that evolved into this resolution, was meant to close ISSUE-13 [3].  So, I have changed the status of ISSUE-13 to "pending approval" and suggest that the implementation of this resolution be considered editorial in nature.
>>> 
>>> Any objections?
>>> 
>>> Regards,
>>> Dave
>>> 
>>> [1] http://www.w3.org/2011/rdf-wg/meeting/2012-05-09#resolution_1
>>> [2] http://lists.w3.org/Archives/Public/public-rdf-wg/2012May/0006.html
>>> [3] https://www.w3.org/2011/rdf-wg/track/issues/13
>>> 
>>> 
>>> 
>>> 
>> 
>> ------------------------------------------------------------
>> IHMC                                     (850)434 8903 or (650)494 3973   
>> 40 South Alcaniz St.           (850)202 4416   office
>> Pensacola                            (850)202 4440   fax
>> FL 32502                              (850)291 0667   mobile
>> phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 

------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973   
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Friday, 11 May 2012 06:58:21 UTC