Re: provenance experiments from Paul Groth on 2010-12-07 (www-archive@w3.org from December 2010)

From: Paul Groth <pgroth@gmail.com>
Date: Tue, 07 Dec 2010 22:10:13 +0100
To: Dan Brickley <danbri@danbri.org>
CC: Toby Inkster <mail@tobyinkster.co.uk>, Damian Steer <pldms@mac.com>, "www-archive@w3.org" <www-archive@w3.org>
Message-ID: <4CFEA2B5.6030603@vu.nl>
Hi Dan,

Cool! Some comments on OPMV hopefully to clarify things.

OPMV provides you with some simple primitives about provenance and gives 
them some semantics.

The basic one is opm:wasDerivedFrom which states that something was 
derived from something else. It's transitive.

So in your example case, we could infer that the surf page was derived 
from information in freebase via the new york times article. If you look 
at the dc:source property, it has that meaning in plain text but isn't 
defined that way in the RDF.

Another cool thing is that opm assumes your talking about a state in 
time of the thing . So the surf  page at time t was derived from the 
article at time t-1. If you provide time annotations, one can then 
reason about whether the provenance chain is consistent. If not, that's 
ok to.

The other parts of OPMV let you describe some  process that went on. In 
this case, this is useful if you want to describe who made the actual 
derivation (e.g. John Smith).

On your comment about fragility:
I think in general if we could get people to mark-up their pages with 
some sort of provenance information it would be great even if it breaks 
sometimes. My best example of this is retweeting. It's not the most 
robust thing in the world but it still is extremely useful.

Do you have next steps? Is there something I can do to help?

Thanks,
Paul


Dan Brickley wrote:
> (cc-ing www-archive so I can find these notes again...)
>
>
> Ok so just blundering around 2 or 3 related design spaces here, with
> skeleton of a use case... exploring
> http://buzzword.org.uk/2009/rdfa4/spec mixed with Paul's example.
>
> http://svn.foaf-project.org/foaftown/2010/prov/
>
> ...has surf3.html which was derrived from paul's opmv experiments, and
> still uses that style of markup. Oh, I added in some more factual info
> (age, gender) since otherwise I don't see any value in triple
> provenance; a simple 'based in part on' chain would otherwise be fine.
>
> ./graphit.pl is a script using Toby's graph-naming RDFa perl parser
>
> nyt-example.html is an imaginary file about this (real) Kelly
> character which I imagine the NYT hosting. It says "here's some stuff
> which we sourced from Freebase, and here's some stuff we're just
> telling you.  And pointers off to Freebase who have their own RDF and
> sourcing thing going on, we might hope. A side theme here is
> distinguishing static properties whose erm facticity doesn't change
> over time (dates of birth) from those that go predictably stale quite
> quickly (eg. age). So if we can crawl back the provenance trail to
> find dateOfBirth instead of age, that's kinda nice.
>
>
> rapper -i rdfa nyt-example.html http://nyt.example.com/people/kelly_slater/
> rapper: Parsing URI
> file:///Users/danbri/working/foaf/foaftown/2010/prov/nyt-example.html
> with parser rdfa and base URI
> http://nyt.example.com/people/kelly_slater/
> rapper: Serializing with serializer ntriples and base URI
> http://nyt.example.com/people/kelly_slater/
> <http://nyt.example.com/people/kelly_slater/#id>
> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
> <http://xmlns.com/foaf/0.1/Person>  .
> <http://nyt.example.com/people/kelly_slater/#id>
> <http://xmlns.com/foaf/0.1/name>  "Kelly Slater" .
> <http://nyt.example.com/people/kelly_slater/#id>
> <http://xmlns.com/foaf/0.1/age>  "38" .
> <http://nyt.example.com/people/kelly_slater/#id>
> <http://xmlns.com/foaf/0.1/dateOfBirth>  "1972-02-11" .
> <http://nyt.example.com/people/kelly_slater/#id>
> <http://xmlns.com/foaf/0.1/gender>  "male" .
> <http://nyt.example.com/people/kelly_slater/>
> <http://xmlns.com/foaf/0.1/primaryTopic>
> <http://nyt.example.com/people/kelly_slater/#id>  .
> <http://nyt.example.com/people/kelly_slater/#from_freebase>
> <http://purl.org/dc/terms/source>
> <http://www.freebase.com/view/en/kelly_slater>  .
> <http://www.nytimes.com/2009/09/18/sports/18surfing.html>
> <http://xmlns.com/foaf/0.1/primaryTopic>
> <http://nyt.example.com/people/kelly_slater/#id>  .
> <http://www.nytimes.com/2010/11/14/sports/14surfing.html>
> <http://xmlns.com/foaf/0.1/primaryTopic>
> <http://nyt.example.com/people/kelly_slater/#id>  .
> <http://www.nytimes.com/2006/08/20/sports/playmagazine/20slater-irons.html>
> <http://xmlns.com/foaf/0.1/topic>
> <http://nyt.example.com/people/kelly_slater/#id>  .
>
>
> rapper -i rdfa surf3.html
> rapper: Parsing URI
> file:///Users/danbri/working/foaf/foaftown/2010/prov/surf3.html with
> parser rdfa
> rapper: Serializing with serializer ntriples
> <http://opmv.googlecode.com/svn/trunk/js/example/>
> <http://www.w3.org/1999/xhtml/vocab#stylesheet>
> <http://opmv.googlecode.com/svn/trunk/js/example/./style.css>  .
> <http://opmv.googlecode.com/svn/trunk/js/example/>
> <http://www.w3.org/1999/xhtml/vocab#meta>
> <http://opmv.googlecode.com/svn/trunk/js/example/#>  .
> <http://opmv.googlecode.com/svn/trunk/js/example/#quote>
> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
> <http://purl.org/net/opmv/ns#Artifact>  .
> _:bnode0<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
> <http://xmlns.com/foaf/0.1/Person>  .
> _:bnode0<http://xmlns.com/foaf/0.1/name>  "Kelly Slater" .
> _:bnode0<http://xmlns.com/foaf/0.1/age>  "38" .
> <http://opmv.googlecode.com/svn/trunk/js/example/#aggregation>
> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
> <http://purl.org/net/opmv/ns#Process>  .
> <http://opmv.googlecode.com/svn/trunk/js/example/#aggregation>
> <http://purl.org/net/opmv/ns#used>
> <http://www.nytimes.com/2010/03/14/sports/14surf.html>  .
> <http://opmv.googlecode.com/svn/trunk/js/example/#aggregation>
> <http://purl.org/net/opmv/ns#wasPerformedBy>  "John Smith" .
> <http://opmv.googlecode.com/svn/trunk/js/example/#quote>
> <http://purl.org/net/opmv/ns#wasGeneratedBy>
> <http://opmv.googlecode.com/svn/trunk/js/example/#aggregation>  .
>
> That's the flat triples view. But say we wanted to drill into the Web
> from the blog post, and figure out 'ok, so is this kelly really 38?'
> (apologies to Kelly if you're googling yourself btw), is graph=
> addition to RDFa any use? So the perl script in there runs Toby's
> stuff which partitions the above triples into different
> graphs/buckets.
>
> Useful? Well I don't know. Intriguing for sure. There are also lots of
> design options around what URIs to use, and complicated here 'cos I
> rigged my surf3.html to use Paul's repo as base href, so the css and
> image relateive links work.
>
> Here's how surf3.html partitions itself:
>
> TellyClub:prov danbri$ ./graphit.pl surf3.html
>
> # Graph URI: _:RDFaDefaultGraph
> <http://opmv.googlecode.com/svn/trunk/js/example/>
> <http://www.w3.org/1999/xhtml/vocab#stylesheet>
> <http://opmv.googlecode.com/svn/trunk/js/example/style.css>  .
>
>
> # Graph URI: http://opmv.googlecode.com/svn/trunk/js/example/nyt-example.html
> [] a<http://xmlns.com/foaf/0.1/Person>  ;
> 	<http://xmlns.com/foaf/0.1/age>  "38" ;
> 	<http://xmlns.com/foaf/0.1/name>  "Kelly Slater" .
> <http://opmv.googlecode.com/svn/trunk/js/example/#aggregation>
> <http://purl.org/net/opmv/ns#used>
> <http://www.nytimes.com/2010/03/14/sports/14surf.html>  ;
> 	<http://purl.org/net/opmv/ns#wasPerformedBy>  "John Smith" ;
> 	a<http://purl.org/net/opmv/ns#Process>  .
> <http://opmv.googlecode.com/svn/trunk/js/example/#quote>
> <http://purl.org/net/opmv/ns#wasGeneratedBy>
> <http://opmv.googlecode.com/svn/trunk/js/example/#aggregation>  ;
> 	a<http://purl.org/net/opmv/ns#Artifact>  .
>
>
> ...this separates the triples it claims to have gotten from NYT from
> other stuff in the page. The graph URI is nyt-example.html so let's
> look at that now:
>
> TellyClub:prov danbri$ ./graphit.pl nyt-example.html
>
> # Graph URI: http://nyt.example.com/people/kelly_slater/#catalog
> <http://www.nytimes.com/2006/08/20/sports/playmagazine/20slater-irons.html>
> <http://xmlns.com/foaf/0.1/topic>
> <http://nyt.example.com/people/kelly_slater/#id>  .
> <http://www.nytimes.com/2009/09/18/sports/18surfing.html>
> <http://xmlns.com/foaf/0.1/primaryTopic>
> <http://nyt.example.com/people/kelly_slater/#id>  .
> <http://www.nytimes.com/2010/11/14/sports/14surfing.html>
> <http://xmlns.com/foaf/0.1/primaryTopic>
> <http://nyt.example.com/people/kelly_slater/#id>  .
>
>
> # Graph URI: http://nyt.example.com/people/kelly_slater/#from_freebase
> <http://nyt.example.com/people/kelly_slater/>
> <http://xmlns.com/foaf/0.1/primaryTopic>
> <http://nyt.example.com/people/kelly_slater/#id>  .
> <http://nyt.example.com/people/kelly_slater/#from_freebase>
> <http://purl.org/dc/terms/source>
> <http://www.freebase.com/view/en/kelly_slater>  .
> <http://nyt.example.com/people/kelly_slater/#id>  a
> <http://xmlns.com/foaf/0.1/Person>  ;
> 	<http://xmlns.com/foaf/0.1/age>  "38" ;
> 	<http://xmlns.com/foaf/0.1/dateOfBirth>  "1972-02-11" ;
> 	<http://xmlns.com/foaf/0.1/gender>  "male" ;
> 	<http://xmlns.com/foaf/0.1/name>  "Kelly Slater" .
>
> ...again this separates things that the NYT is supposedly telling us
> (eg. metadata about its catalogue of articles) from facts it
> associates (via dc:source here) with Freebase.
>
> Now both of these pages could be lying or mistaken of course, as could
> Freebase. The appeal I see with partitioning the RDFa into graph'd
> chunks is that we can associate a dc:source with each bit of info. I'm
> not entirely show how OPMV fits in here, but that's not suprising as
> I've plenty of reading left to do. So re named graph URIs, we'd
> probably not want to use the proposed URIs directly when loading into
> a quad store, and use some generated uuid or whatever instead, so that
> mischievous names would be harmless. But this does seem to suggest
> ways of pointing back down the chain to source files, and maybe also
> detecting loops even? Easy to imagine a Wikipedia page acquiring a
> 'source' pointer to the NYT article, not realising that it was sourced
> from Freebase which got it at Wikipedia in the first place.
>
> So this is all intriguing but also gives me the feeling it might be a
> bit fragile...
>
> </thinking_out_loud>
>
> cheers,
>
> Dan
>
> ps. I had some similar experiment last year,
> http://svn.foaf-project.org/foaftown/2009/headstream/readme.txt ...
> which was about separating the things some social site says about the
> user (and may have generated w/ stats, fact checked etc) from the
> things they say about themselves. In the absense of named graph RDFa I
> used SPARQL constructs to implement the partitioning. Kinda worked.

-- 
Dr. Paul Groth (p.t.groth@vu.nl)
http://www.few.vu.nl/~pgroth/
Postdoc
Knowledge Representation & Reasoning Group
Artificial Intelligence Section
Department of Computer Science
VU University Amsterdam
Received on Tuesday, 7 December 2010 21:10:47 UTC