Re: Use case for g-snaps from Sandro Hawke on 2011-09-30 (public-rdf-prov@w3.org from September 2011)

From: Sandro Hawke <sandro@w3.org>
Date: Fri, 30 Sep 2011 08:10:34 -0400
To: Richard Cyganiak <richard@cyganiak.de>
Cc: public-rdf-prov@w3.org
Message-ID: <1317384634.5766.95.camel@waldron>
<http://lists.w3.org/Archives/Public/public-rdf-prov/2011Sep/0022>
rr:agreement 0.99.

Below I expand on two points:

- yes, I agree, let's only give URIs to g-boxes (not g-snaps)
- how do we practically support static g-boxes?

On Fri, 2011-09-30 at 11:11 +0200, Richard Cyganiak wrote:
> Sandro,
> 
> On 30 Sep 2011, at 06:52, Sandro Hawke wrote:
> > SOLUTION A: Charlie Publishes a Copy of G1
> 
> That's a perfectly workable solution AFAICT.
> 
> >        I've implemented this kind of thing, but it always makes me a
> >        bit nervous, because Charlie could change page2.  
> 
> In SOLUTION B, Charlie could change the embedded graph literal in just the same way. This is a shared limitation between SOLUTION A and SOLUTION B. It's a problem only if Alice doesn't trust Charlie. In that case we'd need a trusted third party that takes snapshots of things on the web (like the Wayback Machine, but perhaps with snapshotting on demand). On the named graphs level, that solution looks exactly the same except that it puts page2 on a different domain.
> 
> >        This seems quite inefficient, but it might not be as bad as the
> >        alternatives.
> 
> This inefficiency is again shared with SOLUTION B. SOLUTION B requires making a copy of G1 too. SOLUTION A requires an extra HTTP request, SOLUTION B bloats Charlie's graph. Their relative efficiency depends on the size of G1. SOLUTION A is more efficient than SOLUTION B if G1 is large.
> 
> The inefficiency in SOLUTION A can be avoided if Charlie publishes a timestamp and/or hash for G1, as you describe in SOLUTION C.
> 
> > SOLUTION-C: Charlie Characterizes G1
> > 
> >        Maybe there's a way to know about Errol changing the graph
> >        without transmitting the graph. For instance, Charlie might
> >        include a hash of the contents:
> 
> I'd say that hashes, timestamps and so on are clearly out of scope for RDF-WG.
> 
> > It's not clear to me yet which parts of this are our domain to
> > standardize.    Certainly "{...}" or "turtleText" are.  
> 
> Those would be in scope for RDF-WG.

So, I think the way we're talking here, making SOLN-A and SOLN-B be very
close parallels, differing only in whether the contents are in-line or
out-of-band, ...  I think that means the Graph identifier part of the
formats supporting nice in-line syntaxes (eg TriG) is really identifying
a g-box.   So the other triples, referring to Errol's statement, don't
have to change when one switches between SOLN-A and SOLN-B.

Under normal operation, it would be equivalent to say:

        1.  In Turtle:
        
                <http://example.org/foo> <p> <o>.
                
        while at http://example.org/foo is the Turtle:
        
                <a> <b> <c>.
                
or

        2.  In something like TriG:
        
                <http://example.org/foo> <p> <o>.
                <http://example.org/foo> { <a> <b> <c>. }
                
                
I'm thinking that's a simple and workable approach.  I'm not sure if
that's what you were proposing or not.

This means Charlie can't exactly say, "I agree with Error's RDF graph
(g-snap) G1" because he can't make an identifier for the G-snap G1;
instead he says "I agree with Error's RDF graph which I have copied to
this g-box, <foo>".   

I think I like that design -- never having g-snaps identified directly
-- so people have less to get confused about.  It's like a programming
language that always passes by value, never by reference, so there's
less confusion.   We can't get rid of g-boxes -- those are files with
RDF in them -- so let's get rid of (direct) g-snaps.

I guess it's also like how people don't generally make up names for
numbers.  They either serialize the number, or give a name to a slot
that holds a number that might be edited (eg "the world population in
2000").  It wonder if there is a parallel to Pi or e -- a few particular
RDF graphs to which it would be good to give standard identifiers.

So, I guess I'm with you on not having a mechanism for directly
attaching URIs to g-snaps.   People can attach them to g-boxes, and if
they are confident it wont change, they can just think of it as a
g-snap.   Hmmmm.

> > Maybe "cameFrom"
> > and "hashWhenFetched".   Probably not "agreement", at least not in this
> > fuzzy form.
> 
> None of those are in scope for RDF-WG. They are in scope for PROV-WG.

Sounds reasonable to me.

> (Another point regarding your use case: Errol shouldn't have fixed his mistake in place, but deleted the old assertions and published a corrected account under a new address. The latter should be considered best practice in situations of this kind. We can't really expect Charlie to do extra work to ensure that Errol can fix the mistake in place – the incentives are not right. His motive is probably only to prevent Alice from making a poor decision based on Errol's disinformation, not to protect Errol's reputation.)

Excellent point.   But surely there are RDF documents on the web that
are going to be changing in place, like people's foaf files....  How
would you allow that?

One approach is like W3C TRs -- there's a "latest URI", where the
contents changes, and a new "snapshot URI" every time the contents
change.  (And old snapshots can be deleted to save space whenever you
want.)  I think this is a good practice, but can we really ask everyone
with a foaf file to follow it...?   Maybe....   Yeah.....

I've never implemented it, but I've often thought about making snapshot
URIs include a secure hash of the contents.   So Errol would publish his
first statement at:

http://errol.example.org/check-sha/13ae3ec8f7c3b8f814ab8f1da9510ebdc0f8c740f1763f825429e9e8c3c21878

and Charlie would copy it over to 

http://charlie.example.org/check-sha/13ae3ec8f7c3b8f814ab8f1da9510ebdc0f8c740f1763f825429e9e8c3c21878

Here I'm suggesting "check-sha" would signal to receivers that they
SHOULD confirm the contents.  That means they wouldn't have to trust
Errol or Charlie not to maliciously or accidentally change things.  It
would essentially force people to follow the practice of making a new
URI every time they want to change the contents.

This would not allow content negotiation on snapshots, although it could
still be used on the "latest version" so maybe that's okay.   Con-neg on
the latest version could pass along the snapshot URI for that
content-type.  

It's also doing unauthorized URI inspection; I suppose we could fix that
by making it be .well-known/check-sha.   I bet we'd get into an
interesting conversation with some IETF folks over that.  :-)    There
may be a way to integrate this with Memento; I don't remember how it
works, exactly.

/me goes back and rereads http://www.w3.org/2003/08/introhash/v2 which
is a little dated but still cool.   :-)    Something like that might be
good for folks who want a secure latest-version URI, but it's probably
too complicated for the current deployment environment.

   -- Sandro
Received on Friday, 30 September 2011 12:10:42 UTC