Re: RDF PATCH and Downstream consequences of blank nodes [was Re: SPARQL Profile for PATCH] from David Booth on 2014-09-24 (semantic-web@w3.org from September 2014)

From: David Booth <david@dbooth.org>
Date: Wed, 24 Sep 2014 11:48:32 -0400
To: Tim Berners-Lee <timbl@w3.org>
CC: SW-forum Web <semantic-web@w3.org>
Message-ID: <5422E7D0.4000108@dbooth.org>
Hi Tim,

On 09/24/2014 01:59 AM, Tim Berners-Lee wrote:
> On 2014-09 -23, at 22:34, David Booth <david@dbooth.org> wrote:
>
>> BTW, I want to draw attention to the fact that the need for
>> defining an RDF-specific PATCH operation is *entirely* a
>> consequence of RDF's allowance of unrestricted blank nodes.  I do
>> not think that blank nodes should be eliminated from RDF, but I am
>> convinced that RDF's current treatment of blank nodes is a
>> significant design flaw that has *many* downstream effects that are
>> ultimately detrimental to RDF's adoption. The need for RDF PATCH is
>> another example.
>
> I disagree.

I acknowledge that there are some who would not consider this problem to 
be a "design flaw" -- that's a value judgement.   But I think the 
detrimental downstream effects of unrestricted blank nodes are 
indisputable, and the consequential need for an RDF-specific PATCH 
operation is a good example.

> Most people use blank nodes in a reasonable fashion.

Agreed.  That is the point behind Well Behaved RDF: if we judiciously 
restrict the use of blank nodes in a way that largely coincides with the 
way they are most commonly used, then we can still have the most 
important benefits of blank nodes while eliminating the problems caused 
by *unconstrained* blank nodes.

>
>>
>> Unix/linux diff and patch utilities have been used successfully for
>> *decades*, with many other information representations.  Imagine
>> how simple and easy it would be if we could just generate canonical
>> N-Triples and use standard diff and patch against that!
>
> You can do that if you want. With RDF you have a graph, not a tree.
> Not an ordered list. There are canonicalization algorithms as you
> know. Just not a standard.

A PATCH operation can only work if the canonicalization is *standard*. 
Otherwise when a patch is sent from one party to another, the sending 
party may have used a different canonicalization algorithm than the 
receiving party assumes, and the result will be garbage.

>
> For many of my applications, the app which changes the data knows
> what it is changing and so the diff does not have to be generated
> form the new and old graph. The UI widget which changes the value
> knows what it is changing.

Sure, but the PATCH operation still must be defined with respect to an 
agreed-upon "old graph", even if that "old graph" is not actually 
materialized.

>
>
>> But we can't, because blank nodes are unstable across RDF
>> serializations and no canonical way to generate them has been
>> standardized.  This, in turn is because generating a canonical form
>> of unrestricted RDF is a hard problem (NP-complete), because of
>> blank nodes.  The problem is *much* easier if the use of blank
>> nodes is limited to *implicit* blank nodes -- those that are
>> generated implicitly by the use of square brackets "[]" or
>> parentheses "()" for lists in Turtle -- and indeed this is the vast
>> majority of blank node use.  (See "Everything You Always Wanted to
>> Know About Blank Nodes", by Hogan, Arenas, Mallea and Polleres:
>> http://www.websemanticsjournal.org/index.php/ps/article/viewFile/365/387
>> )
>>
>> For this reason the use of "Well Behaved RDF" was proposed, which
>> limits the use of blank nodes to implicit blank nodes:
>> http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf
>> I don't know if Well Behaved RDF is the best solution to this
>> problem. Maybe someone will come along with a better idea.  But I
>> am convinced that the current treatment of blank nodes in RDF is a
>> serious problem that we should fix in order to make RDF simpler to
>> use, understand and adopt.
>
> I think that in theory the paper you quote
> http://www.websemanticsjournal.org/index.php/ps/article/viewFile/365/387
>does a nice job and in practice people tend to use them in
 > a constrained way.

Yes, that's part of my point.

> You have one definition of well-behaved RDF.
>
> Two others are:
>
> Locally: - Each blank node can be uniquely identified WITHIN THE
> GRAPH IN QUESTION by a path from a non-blank node or constant.
>
> Globally: - Each blank node can be uniquely identified WITHIN THE
> WORLD by a path over [inverse] functional properties  to a non-blank
> node or constant.

Yes, the definition that I proposed was intended as a straw man to 
initiate discussion, but some other definition may ultimately be preferred.

The local/global techniques that you mention might be a good basis for 
canonical naming of blank nodes, or Skolemization.  Sandro (I think) 
once suggested that another way RDF could have approached blank nodes 
was that they should be treated merely as syntactic devices for 
auto-generated URIs.  The techniques you mention above might be useful 
in such an approach.  This would basically amount to a Skolemization 
technique that would predictably generate the same URI from the same 
blank node.

>
> We discussed this in a diff paper ages ago...  I can't find the paper
> but here is a some code from 2005:
>
> https://www.w3.org/2000/10/swap/delta.py
>
> This generates diffs. cwm will accept  diffs to patch a graph. See
> https://www.w3.org/2000/10/swap/test/delta/detailed.tests
>
>
>> I really don't like having to make excuses for RDF when it cannot
>> be used in a similar way as nearly every other information
>> representation -- such as being able to easily compare two RDF
>> documents for "equality" (which in RDF becomes a complex graph
>> isomorphism problem) or generate a simple diff and patch -- all
>> because of RDF's unrestricted treatment of blank nodes.
>
> Here is some code from 2003: https://www.w3.org/2000/10/swap/cant.py
>
> This is what we use for testing equality of graphs in the CWM
> regression test suite. It works in practice on graphs we use.
>
>> Clearly this is not something that the Linked Data Platform working
>> group can fix.  But I think it is important to bring it to people's
>> attention, in the hope that we will someday soon have the
>> creativity and gumption to fix it.
>
> The LDP WG defines a constrained form of "non-pathalogical" RDF in
> which their patch system works.
>
>> I should also acknowledge that there are some who do not feel that
>> RDF's treatment of blank nodes is a problem.  Fine.  It may not be
>> a problem to an elite few who are well steeped in the subtleties of
>> description logic, model theory and RDF Semantics, and who don't
>> mind having to use RDF-specific tools instead of generic tools.
>> But having tried for over 10 years to explain RDF to a wider
>> audience of regular software developers, I am convinced that
>> subtleties like RDF's treatment of blank nodes *are* a problem to a
>> much wider audience of *potential* RDF users who would be more
>> inclined to adopt RDF if it didn't have complexities like this.  As
>> it is they are more likely to stick with JSON or XML, whose
>> complexities they already know, rather than venturing into the
>> obscure and esoteric world of RDF.
>>
>
> I think developers (like me) left to their own devices will produce
> RDF graphs which are typically tree-like and well-behaved in the ways
> they need, in fact.

Yes, again that is the main motivation behind Well Behaved RDF: it can 
be a minimally invasive restriction because in most cases it is what RDF 
users are already doing.  In the few cases where it isn't, the RDF 
author can mint a URI.

>
>> RDF tools are not as mature as those for XML or even JSON, which is
>> much younger than RDF.  I believe blank nodes are one specific
>> reason they're not.  The fact that we still don't even have a
>> simple, standard way to compare RDF documents and compute diffs and
>> patches, is a perfect example.
>
> I think blank nodes are valuable and essential.  I would hate to have
> to clutter my data with names for those nodes.

Under Well-Behaved RDF, in the vast majority of cases, you wouldn't have 
to clutter your data with extra URIs.  You could still use *implicit* 
blank nodes as much as you want, without causing problems.  Only in 
those few, potentially problematic cases would you need to mint a URI 
instead of using a blank node.  I personally think that would be a 
reasonable trade-off, in order to reduce some of RDF's complexity.

David

>
> RDF tools have been around for a while.
>
> (I have it on my (rather long) agenda to move the SWAP (cwm etc)
> suite from the old CVS repo to github, as it could do with some TLC,
> but I use it all the time.   I just tried the CVS to git import and
> it fails "Fetching RDFSink.py   v 1.3.2.1 \n Unknown: error " ...
> sigh)
>
> Repo - http://dev.w3.org/cvsweb/2000/10/swap/
>
> Tim
>
>> David
>>
>>
>
Received on Wednesday, 24 September 2014 15:49:03 UTC