- From: Luc Moreau <L.Moreau@ecs.soton.ac.uk>
- Date: Fri, 30 Sep 2011 14:01:09 +0100
- To: public-rdf-prov@w3.org
Hi Sandro, Something may be immaterial but we may still want to talk about its provenance: e.g. provenance of an idea. But here, maybe you want to say, that g-snaps do not have an observable incarnation, otherwise, they would be g-texts. I think it would be useful to go through a concrete use case. The data journalism example was not explicit about the way rdf data was exposed. We would have to be explicit. Luc On 09/30/2011 01:41 PM, Sandro Hawke wrote: > On Fri, 2011-09-30 at 13:29 +0100, Luc Moreau wrote: > >> Hi Sandro, >> >> This discussion has become quite technical, and I am not sure I >> understand all >> the implications. I am surprised by your suggestion not to name g-snaps. >> > Me, too. It's something of a change provoked by Richard asking me to > get more detailed in the use case. > > >> What I liked about the distinction between g-snap/g-box is that it >> allowed me >> to talk about the content and the container. From a provenance perspective, >> we may want to say different things about them. The content was generated by >> an rdb2rdf converter launched by Luc, whereas the container is this rdf >> file, >> which Sandro stored on his local disk. >> > The problem, I think, is that g-snaps are inherently immaterial (being > mathematical sets); they are quite a bit more abstract that some other > computer concepts like strings (which are g-texts if they contain a > serialization of a g-snap) and files (which are g-boxes if they contain > a serialization of a g-snap). We know how to work with strings (copy > them) and files (use their names), but it's not clear we can or should > work more directly with g-snaps than by working with the associated > strings and/or files. (For this discussion, I'm thinking of a Web > page as just a kind of remote file.) > > >> From a provenance viewpoint, we require the thing we talk about in the >> provenance >> to be identifiable. With URI-less g-snaps, this is going to become more >> challenging. >> > The conclusion I'm coming to here is that our best option is to talk > about g-snaps indirectly, by talking about g-texts or g-boxes which > contain their serializations. > > Is there a situation where you're thinking this might not work very > well? (Perhaps we should work through some bit of the Data Journalism > scenario...) > > >> I may also have misunderstood the g-concepts. >> What do you think? >> > Nope, sounds like you've got it. :-) > > -- Sandro > > >> Cheers, >> Luc >> >> On 30/09/2011 13:10, Sandro Hawke wrote: >> >>> <http://lists.w3.org/Archives/Public/public-rdf-prov/2011Sep/0022> >>> rr:agreement 0.99. >>> >>> Below I expand on two points: >>> >>> - yes, I agree, let's only give URIs to g-boxes (not g-snaps) >>> - how do we practically support static g-boxes? >>> >>> On Fri, 2011-09-30 at 11:11 +0200, Richard Cyganiak wrote: >>> >>> >>>> Sandro, >>>> >>>> On 30 Sep 2011, at 06:52, Sandro Hawke wrote: >>>> >>>> >>>>> SOLUTION A: Charlie Publishes a Copy of G1 >>>>> >>>>> >>>> That's a perfectly workable solution AFAICT. >>>> >>>> >>>> >>>>> I've implemented this kind of thing, but it always makes me a >>>>> bit nervous, because Charlie could change page2. >>>>> >>>>> >>>> In SOLUTION B, Charlie could change the embedded graph literal in just the same way. This is a shared limitation between SOLUTION A and SOLUTION B. It's a problem only if Alice doesn't trust Charlie. In that case we'd need a trusted third party that takes snapshots of things on the web (like the Wayback Machine, but perhaps with snapshotting on demand). On the named graphs level, that solution looks exactly the same except that it puts page2 on a different domain. >>>> >>>> >>>> >>>>> This seems quite inefficient, but it might not be as bad as the >>>>> alternatives. >>>>> >>>>> >>>> This inefficiency is again shared with SOLUTION B. SOLUTION B requires making a copy of G1 too. SOLUTION A requires an extra HTTP request, SOLUTION B bloats Charlie's graph. Their relative efficiency depends on the size of G1. SOLUTION A is more efficient than SOLUTION B if G1 is large. >>>> >>>> The inefficiency in SOLUTION A can be avoided if Charlie publishes a timestamp and/or hash for G1, as you describe in SOLUTION C. >>>> >>>> >>>> >>>>> SOLUTION-C: Charlie Characterizes G1 >>>>> >>>>> Maybe there's a way to know about Errol changing the graph >>>>> without transmitting the graph. For instance, Charlie might >>>>> include a hash of the contents: >>>>> >>>>> >>>> I'd say that hashes, timestamps and so on are clearly out of scope for RDF-WG. >>>> >>>> >>>> >>>>> It's not clear to me yet which parts of this are our domain to >>>>> standardize. Certainly "{...}" or "turtleText" are. >>>>> >>>>> >>>> Those would be in scope for RDF-WG. >>>> >>>> >>> So, I think the way we're talking here, making SOLN-A and SOLN-B be very >>> close parallels, differing only in whether the contents are in-line or >>> out-of-band, ... I think that means the Graph identifier part of the >>> formats supporting nice in-line syntaxes (eg TriG) is really identifying >>> a g-box. So the other triples, referring to Errol's statement, don't >>> have to change when one switches between SOLN-A and SOLN-B. >>> >>> Under normal operation, it would be equivalent to say: >>> >>> 1. In Turtle: >>> >>> <http://example.org/foo> <p> <o>. >>> >>> while at http://example.org/foo is the Turtle: >>> >>> <a> <b> <c>. >>> >>> or >>> >>> 2. In something like TriG: >>> >>> <http://example.org/foo> <p> <o>. >>> <http://example.org/foo> {<a> <b> <c>. } >>> >>> >>> I'm thinking that's a simple and workable approach. I'm not sure if >>> that's what you were proposing or not. >>> >>> This means Charlie can't exactly say, "I agree with Error's RDF graph >>> (g-snap) G1" because he can't make an identifier for the G-snap G1; >>> instead he says "I agree with Error's RDF graph which I have copied to >>> this g-box,<foo>". >>> >>> I think I like that design -- never having g-snaps identified directly >>> -- so people have less to get confused about. It's like a programming >>> language that always passes by value, never by reference, so there's >>> less confusion. We can't get rid of g-boxes -- those are files with >>> RDF in them -- so let's get rid of (direct) g-snaps. >>> >>> I guess it's also like how people don't generally make up names for >>> numbers. They either serialize the number, or give a name to a slot >>> that holds a number that might be edited (eg "the world population in >>> 2000"). It wonder if there is a parallel to Pi or e -- a few particular >>> RDF graphs to which it would be good to give standard identifiers. >>> >>> So, I guess I'm with you on not having a mechanism for directly >>> attaching URIs to g-snaps. People can attach them to g-boxes, and if >>> they are confident it wont change, they can just think of it as a >>> g-snap. Hmmmm. >>> >>> >>> >>>>> Maybe "cameFrom" >>>>> and "hashWhenFetched". Probably not "agreement", at least not in this >>>>> fuzzy form. >>>>> >>>>> >>>> None of those are in scope for RDF-WG. They are in scope for PROV-WG. >>>> >>>> >>> Sounds reasonable to me. >>> >>> >>> >>>> (Another point regarding your use case: Errol shouldn't have fixed his mistake in place, but deleted the old assertions and published a corrected account under a new address. The latter should be considered best practice in situations of this kind. We can't really expect Charlie to do extra work to ensure that Errol can fix the mistake in place – the incentives are not right. His motive is probably only to prevent Alice from making a poor decision based on Errol's disinformation, not to protect Errol's reputation.) >>>> >>>> >>> Excellent point. But surely there are RDF documents on the web that >>> are going to be changing in place, like people's foaf files.... How >>> would you allow that? >>> >>> One approach is like W3C TRs -- there's a "latest URI", where the >>> contents changes, and a new "snapshot URI" every time the contents >>> change. (And old snapshots can be deleted to save space whenever you >>> want.) I think this is a good practice, but can we really ask everyone >>> with a foaf file to follow it...? Maybe.... Yeah..... >>> >>> I've never implemented it, but I've often thought about making snapshot >>> URIs include a secure hash of the contents. So Errol would publish his >>> first statement at: >>> >>> http://errol.example.org/check-sha/13ae3ec8f7c3b8f814ab8f1da9510ebdc0f8c740f1763f825429e9e8c3c21878 >>> >>> and Charlie would copy it over to >>> >>> http://charlie.example.org/check-sha/13ae3ec8f7c3b8f814ab8f1da9510ebdc0f8c740f1763f825429e9e8c3c21878 >>> >>> Here I'm suggesting "check-sha" would signal to receivers that they >>> SHOULD confirm the contents. That means they wouldn't have to trust >>> Errol or Charlie not to maliciously or accidentally change things. It >>> would essentially force people to follow the practice of making a new >>> URI every time they want to change the contents. >>> >>> This would not allow content negotiation on snapshots, although it could >>> still be used on the "latest version" so maybe that's okay. Con-neg on >>> the latest version could pass along the snapshot URI for that >>> content-type. >>> >>> It's also doing unauthorized URI inspection; I suppose we could fix that >>> by making it be .well-known/check-sha. I bet we'd get into an >>> interesting conversation with some IETF folks over that. :-) There >>> may be a way to integrate this with Memento; I don't remember how it >>> works, exactly. >>> >>> /me goes back and rereads http://www.w3.org/2003/08/introhash/v2 which >>> is a little dated but still cool. :-) Something like that might be >>> good for folks who want a secure latest-version URI, but it's probably >>> too complicated for the current deployment environment. >>> >>> -- Sandro >>> >>> >>> >>> >>> >> >> > > > -- Professor Luc Moreau Electronics and Computer Science tel: +44 23 8059 4487 University of Southampton fax: +44 23 8059 2865 Southampton SO17 1BJ email: l.moreau@ecs.soton.ac.uk United Kingdom http://www.ecs.soton.ac.uk/~lavm
Received on Friday, 30 September 2011 13:01:56 UTC