Re: Patch Issue: Blank Nodes and Tractability from Sandro Hawke on 2013-09-08 (public-ldp-patch@w3.org from September 2013)

From: Sandro Hawke <sandro@w3.org>
Date: Sun, 08 Sep 2013 12:00:10 -0400
To: Andy Seaborne <andy@apache.org>
CC: public-ldp-patch@w3.org
Message-ID: <522C9F0A.50601@w3.org>
On 09/08/2013 06:33 AM, Andy Seaborne wrote:
> On 08/09/13 04:17, Sandro Hawke wrote:
>> In theory, patching a large branching subgraph of blank nodes is
>> intractable.   It's the graph isomorphism problem, which has 
>> complexity NP.
>>
>> I think rdf lists don't have this problem, at least with a clever
>> implementation, because the subgraph of blank nodes doesn't
>> significantly branch.
>>
>> I wonder if there are many RDF graphs that do have this problem.   I can
>> easily imagine paths which are 2-4 blank nodes deep, coming out of
>> record structures that are mapped to RDF, but even an exponential search
>> is okay when the exponent is only 4.  The question is: are there, in
>> practice, subgraphs where you have to traverse 20 or 30 blank nodes,
>> with branching, to get to the part you need to alter? If we don't know
>> of any, how confident are we there won't be any real need for them?
>>
>> Also, do SPARQL systems handle rdf:lists efficiently enough? Can I
>> append to an rdf:List like this:
>>
>> DELETE { ?last rdf:rest rdf:nil }
>> INSERT { ?last rdf:rest ( :NewLastElement ) }
>> WHERE  { :myList :myProp ?n1.  ?n1 rdf:rest ?n2.  ?n2 rdf:rest ?n3.
>> .....    rdf:rest ?last }
>
> rdf:rest*
>  Arbitrary length path.
>
> DELETE { ?last rdf:rest rdf:nil }
> INSERT { ?last rdf:rest ( :NewLastElement ) }
> WHERE  { :myList :myProp ?list .
>          ?list rdf:rest* ?last .
>          ?last rdf:rest rdf:nil }
>

Sorry, I should have disallowed that shortcut.     Please consider my 
question to be about inserting an element n items into a list that's 
much longer.   Also, assume the items in the list are all the same, so 
you can't use value-matching to find the desired spot in the list.

So there really would have to be n variables.    Do you know how typical 
SPARQL systems will do with thousands of variables like that?

>> where the chain is 20 or 200 or 2000 or even 20,000 elements long and
>> have the the system handle it comfortably?
>
> It has to run down the list if all you give it is the head, but it's a 
> loop, not storing and growing internal state as it goes.
>
>> If the answers to the above questions are not good,
>
> I think it is also hard to understand.  The app developer has been 
> dropped into the triple representation of a list when the date may 
> have shown ( ... ).  It's data assembler.
>

Indeed -- if this is part of a system, it would have to be buried where 
most developers wouldn't know about it.

>> then we might want a
>> patch format that sidesteps the whole problem, for instance by having a
>> limited way to refer to blank nodes in the protocol. For example,
>> patches are often written relative to a particular resource
>> representation, pointed to by an ETag.   I don't see any technical
>> reason the patch couldn't say "I'm using the same blank-node-names as
>> that representation does", and if the server cares enough to retain that
>> mapping, it can do patch processing in linear time.
>
> I agree that dropping down to the label is a good idea and necessary 
> if we don't restrict the styles of RDF that can be effectively managed 
> with a patch.
>

*nod*

Perhaps we should strawpoll on these four possible requirements. I 
believe they're all easy to do if we use labels and 2, 3, and 4 are 
impossible to do if we do not.

1. It must be possible to write a patch to go from any graph G1 to any 
other graph G2
2. It must be possible to write a tractable (PTIME and PSPACE) patch to 
go from any graph G1 to any other graph G2
3. Processing all patches must be tractable.   (That is, not only must 
it be possible to write a tractable patch, but it must be impossible to 
write an intractable one.)
4. Creating a patch to go from any G1 to any G2 must itself be a 
tractable problem.


> It has to be the system's choice of internal label:
>

But which system - the patch creator or the patch receiver?

> 1/ you can't use the label as given by syntax because it may be 
> repeated in unrelated updates.
>
> Update 1: add triples:
>
> <#peter> foaf:knows _:a .
> _:a foaf:name "Paul" .
>
> Update 2: add triples:
>
> <#peter> foaf:knows _:a .
> _:a foaf:name "Patricia" .
>
> I can imagine template-driven updates doing that.
>
> 2/ In Turtle the system generates them for [] and ()
>
>
> The question is how to find the internal label.
>
> SPARQL's STR(?blank) could be used -- in the base spec says it's undef 
> (=error, and so extensible) to call it on a blank node.  Or 
> IRI(?bnode) could be overloaded for skolemization (several systems 
> already do variations on this).
>

Yeah, I think we need to move up from the language to the protocol for 
this, and think about what both ends of the conversation are doing.

Here's a protocol that I think works:

1.  Alice publishes version 1 of an n-triples file at URL U1, with ETag 
"v1".

2.  Bob creates a patch against U1 v1.  He uses the same blank node 
labels  that Alice used, for the blank nodes she had.  For new blank 
nodes, he makes up new labels that Alice has not used in v1.   His patch 
basically consists of two n-triples strings, one with all the triples to 
be deleted, one with all the triples to be added, and he specifies it's 
a patch against v1.

3.  Alice receives and applies the patch, giving the result ETag 
"vBob1".   She only needs to tell Bob that she has accepted the patch 
and given it ETag "vBob1" for them to maintain their shared state.

4.  If Alice gets a concurrent patch from Charlie (that is, another 
patch against v1), and she can determine it does not conflict with Bob's 
patch, she can confirm it to Charlie (creating vCharlie1), then send 
each of them patches to bring them both up to vBob1Charlie1 (or v2, or 
whatever - it's probably really named by a hash).    In the patches she 
constructs for each of them, she will cause them to rename any of the 
blank nodes they created whose names happen to conflict with each other.

There are a few painful things about this protocol, though, that perhaps 
we can improve.

1.  It's annoying to have to use n-triples.  We could move up to Turtle 
while banning () and [].  Other RDF syntaxes would also be fine, as long 
as we disallow the constructs which generate blank nodes.

2.  It's annoying to not have () and [] in step 1.   We could allow them 
if we had a standard naming for the blank nodes they generate.   I think 
we could do that although it would be a bit tricky - I think top-down 
and bottom-up parsers would want to number them differently.   Maybe we 
could just say they're named _:gN where N is a decimal number, starting 
from 1, incremented each time a blank node is implied, moving throught 
the file in order.  (Test case:   "<a> <b> (1 ( 2 [<c> [<d> 3], [<e> 4]] 
5) (6))" or something.)    I imagine Bob would use a special Turtle (or 
whatever) parser which followed this rule, knowing he was going to be 
doing patching.

3.  It's annoying to not have () and [] in step 2.  This is different 
from in step one, because (1) we can't just number from 1, or we'll be 
conflicting with all of Alice' names, and (2) it's a patch syntax not 
(necessarily) an RDF syntax.    Perhaps what's needed is some directive 
in the language or in the metadata which specifies a numeric starting 
point for numbering these generated blank node lables.

4.  The renaming in Step 4 could be painful.   This could be addressed 
by having easy renaming in the patch language.   That is, not just "add 
triple <s> <p> <o>" and "delete triples <s> <p> <o>", but also "rename 
node <old> <new>".  Actually, if we're doing this stuff with numbering 
new blank nodes, and lists, we'd probably want "rename blank node range 
<old-start> <old-end> <new-start>".    So if Bob used blank node numbers 
51-205, and charlie used blank node numbers 51-70, then Alice might 
decide to renumber Charlies blank nodes in constructing vBobCharlie1.   
Then in giving Charlie the patch from vCharlie1 to vBobCharlie1 she'd 
include renumberBlanks(51,70,206).

This is starting to look like the start of a reasonable system. :-)

I should point out that in step 4 I brought in something we've never 
talked about -- that it's important for LDP servers to be able to give 
patches to LDP clients, as well.    Just like a PUT is a very heavy way 
to change a triple, a GET is a very heavy way to find out a triple has 
changed.   So I think we need to expect servers to provide streams of 
patches to clients.

In the LDP style, one way to do this would be metadata on U1, either 
embedded or in HTTP Link headers, saying { U1 ldp:patchToCurrent P1 }, 
and then one can GET P1 will return a patch from U1 to the current 
version.   (I'd also like ldp:patchToNext which gives a URL for a 
longpoll URL, for a patch to whatever the next version is.)

       -- Sandro

>     Andy
>
>>
>>          -- Sandro
>>
>>
>
>
>
Received on Sunday, 8 September 2013 16:00:17 UTC