provenance experiments from Dan Brickley on 2010-12-07 (www-archive@w3.org from December 2010)

From: Dan Brickley <danbri@danbri.org>
Date: Tue, 7 Dec 2010 17:06:14 +0100
To: Paul Groth <pgroth@few.vu.nl>, Toby Inkster <mail@tobyinkster.co.uk>
Cc: Damian Steer <pldms@mac.com>, www-archive@w3.org
Message-ID: <AANLkTi=_wjcW_vM7tkPP_2y2Vrt-dPLX=tPx-XYwt1Tu@mail.gmail.com>
(cc-ing www-archive so I can find these notes again...)


Ok so just blundering around 2 or 3 related design spaces here, with
skeleton of a use case... exploring
http://buzzword.org.uk/2009/rdfa4/spec mixed with Paul's example.

http://svn.foaf-project.org/foaftown/2010/prov/

...has surf3.html which was derrived from paul's opmv experiments, and
still uses that style of markup. Oh, I added in some more factual info
(age, gender) since otherwise I don't see any value in triple
provenance; a simple 'based in part on' chain would otherwise be fine.

./graphit.pl is a script using Toby's graph-naming RDFa perl parser

nyt-example.html is an imaginary file about this (real) Kelly
character which I imagine the NYT hosting. It says "here's some stuff
which we sourced from Freebase, and here's some stuff we're just
telling you.  And pointers off to Freebase who have their own RDF and
sourcing thing going on, we might hope. A side theme here is
distinguishing static properties whose erm facticity doesn't change
over time (dates of birth) from those that go predictably stale quite
quickly (eg. age). So if we can crawl back the provenance trail to
find dateOfBirth instead of age, that's kinda nice.


rapper -i rdfa nyt-example.html http://nyt.example.com/people/kelly_slater/
rapper: Parsing URI
file:///Users/danbri/working/foaf/foaftown/2010/prov/nyt-example.html
with parser rdfa and base URI
http://nyt.example.com/people/kelly_slater/
rapper: Serializing with serializer ntriples and base URI
http://nyt.example.com/people/kelly_slater/
<http://nyt.example.com/people/kelly_slater/#id>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://xmlns.com/foaf/0.1/Person> .
<http://nyt.example.com/people/kelly_slater/#id>
<http://xmlns.com/foaf/0.1/name> "Kelly Slater" .
<http://nyt.example.com/people/kelly_slater/#id>
<http://xmlns.com/foaf/0.1/age> "38" .
<http://nyt.example.com/people/kelly_slater/#id>
<http://xmlns.com/foaf/0.1/dateOfBirth> "1972-02-11" .
<http://nyt.example.com/people/kelly_slater/#id>
<http://xmlns.com/foaf/0.1/gender> "male" .
<http://nyt.example.com/people/kelly_slater/>
<http://xmlns.com/foaf/0.1/primaryTopic>
<http://nyt.example.com/people/kelly_slater/#id> .
<http://nyt.example.com/people/kelly_slater/#from_freebase>
<http://purl.org/dc/terms/source>
<http://www.freebase.com/view/en/kelly_slater> .
<http://www.nytimes.com/2009/09/18/sports/18surfing.html>
<http://xmlns.com/foaf/0.1/primaryTopic>
<http://nyt.example.com/people/kelly_slater/#id> .
<http://www.nytimes.com/2010/11/14/sports/14surfing.html>
<http://xmlns.com/foaf/0.1/primaryTopic>
<http://nyt.example.com/people/kelly_slater/#id> .
<http://www.nytimes.com/2006/08/20/sports/playmagazine/20slater-irons.html>
<http://xmlns.com/foaf/0.1/topic>
<http://nyt.example.com/people/kelly_slater/#id> .


rapper -i rdfa surf3.html
rapper: Parsing URI
file:///Users/danbri/working/foaf/foaftown/2010/prov/surf3.html with
parser rdfa
rapper: Serializing with serializer ntriples
<http://opmv.googlecode.com/svn/trunk/js/example/>
<http://www.w3.org/1999/xhtml/vocab#stylesheet>
<http://opmv.googlecode.com/svn/trunk/js/example/./style.css> .
<http://opmv.googlecode.com/svn/trunk/js/example/>
<http://www.w3.org/1999/xhtml/vocab#meta>
<http://opmv.googlecode.com/svn/trunk/js/example/#> .
<http://opmv.googlecode.com/svn/trunk/js/example/#quote>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://purl.org/net/opmv/ns#Artifact> .
_:bnode0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://xmlns.com/foaf/0.1/Person> .
_:bnode0 <http://xmlns.com/foaf/0.1/name> "Kelly Slater" .
_:bnode0 <http://xmlns.com/foaf/0.1/age> "38" .
<http://opmv.googlecode.com/svn/trunk/js/example/#aggregation>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://purl.org/net/opmv/ns#Process> .
<http://opmv.googlecode.com/svn/trunk/js/example/#aggregation>
<http://purl.org/net/opmv/ns#used>
<http://www.nytimes.com/2010/03/14/sports/14surf.html> .
<http://opmv.googlecode.com/svn/trunk/js/example/#aggregation>
<http://purl.org/net/opmv/ns#wasPerformedBy> "John Smith" .
<http://opmv.googlecode.com/svn/trunk/js/example/#quote>
<http://purl.org/net/opmv/ns#wasGeneratedBy>
<http://opmv.googlecode.com/svn/trunk/js/example/#aggregation> .

That's the flat triples view. But say we wanted to drill into the Web
from the blog post, and figure out 'ok, so is this kelly really 38?'
(apologies to Kelly if you're googling yourself btw), is graph=
addition to RDFa any use? So the perl script in there runs Toby's
stuff which partitions the above triples into different
graphs/buckets.

Useful? Well I don't know. Intriguing for sure. There are also lots of
design options around what URIs to use, and complicated here 'cos I
rigged my surf3.html to use Paul's repo as base href, so the css and
image relateive links work.

Here's how surf3.html partitions itself:

TellyClub:prov danbri$ ./graphit.pl surf3.html

# Graph URI: _:RDFaDefaultGraph
<http://opmv.googlecode.com/svn/trunk/js/example/>
<http://www.w3.org/1999/xhtml/vocab#stylesheet>
<http://opmv.googlecode.com/svn/trunk/js/example/style.css> .


# Graph URI: http://opmv.googlecode.com/svn/trunk/js/example/nyt-example.html
[] a <http://xmlns.com/foaf/0.1/Person> ;
	<http://xmlns.com/foaf/0.1/age> "38" ;
	<http://xmlns.com/foaf/0.1/name> "Kelly Slater" .
<http://opmv.googlecode.com/svn/trunk/js/example/#aggregation>
<http://purl.org/net/opmv/ns#used>
<http://www.nytimes.com/2010/03/14/sports/14surf.html> ;
	<http://purl.org/net/opmv/ns#wasPerformedBy> "John Smith" ;
	a <http://purl.org/net/opmv/ns#Process> .
<http://opmv.googlecode.com/svn/trunk/js/example/#quote>
<http://purl.org/net/opmv/ns#wasGeneratedBy>
<http://opmv.googlecode.com/svn/trunk/js/example/#aggregation> ;
	a <http://purl.org/net/opmv/ns#Artifact> .


...this separates the triples it claims to have gotten from NYT from
other stuff in the page. The graph URI is nyt-example.html so let's
look at that now:

TellyClub:prov danbri$ ./graphit.pl nyt-example.html

# Graph URI: http://nyt.example.com/people/kelly_slater/#catalog
<http://www.nytimes.com/2006/08/20/sports/playmagazine/20slater-irons.html>
<http://xmlns.com/foaf/0.1/topic>
<http://nyt.example.com/people/kelly_slater/#id> .
<http://www.nytimes.com/2009/09/18/sports/18surfing.html>
<http://xmlns.com/foaf/0.1/primaryTopic>
<http://nyt.example.com/people/kelly_slater/#id> .
<http://www.nytimes.com/2010/11/14/sports/14surfing.html>
<http://xmlns.com/foaf/0.1/primaryTopic>
<http://nyt.example.com/people/kelly_slater/#id> .


# Graph URI: http://nyt.example.com/people/kelly_slater/#from_freebase
<http://nyt.example.com/people/kelly_slater/>
<http://xmlns.com/foaf/0.1/primaryTopic>
<http://nyt.example.com/people/kelly_slater/#id> .
<http://nyt.example.com/people/kelly_slater/#from_freebase>
<http://purl.org/dc/terms/source>
<http://www.freebase.com/view/en/kelly_slater> .
<http://nyt.example.com/people/kelly_slater/#id> a
<http://xmlns.com/foaf/0.1/Person> ;
	<http://xmlns.com/foaf/0.1/age> "38" ;
	<http://xmlns.com/foaf/0.1/dateOfBirth> "1972-02-11" ;
	<http://xmlns.com/foaf/0.1/gender> "male" ;
	<http://xmlns.com/foaf/0.1/name> "Kelly Slater" .

...again this separates things that the NYT is supposedly telling us
(eg. metadata about its catalogue of articles) from facts it
associates (via dc:source here) with Freebase.

Now both of these pages could be lying or mistaken of course, as could
Freebase. The appeal I see with partitioning the RDFa into graph'd
chunks is that we can associate a dc:source with each bit of info. I'm
not entirely show how OPMV fits in here, but that's not suprising as
I've plenty of reading left to do. So re named graph URIs, we'd
probably not want to use the proposed URIs directly when loading into
a quad store, and use some generated uuid or whatever instead, so that
mischievous names would be harmless. But this does seem to suggest
ways of pointing back down the chain to source files, and maybe also
detecting loops even? Easy to imagine a Wikipedia page acquiring a
'source' pointer to the NYT article, not realising that it was sourced
from Freebase which got it at Wikipedia in the first place.

So this is all intriguing but also gives me the feeling it might be a
bit fragile...

</thinking_out_loud>

cheers,

Dan

ps. I had some similar experiment last year,
http://svn.foaf-project.org/foaftown/2009/headstream/readme.txt ...
which was about separating the things some social site says about the
user (and may have generated w/ stats, fact checked etc) from the
things they say about themselves. In the absense of named graph RDFa I
used SPARQL constructs to implement the partitioning. Kinda worked.
Received on Tuesday, 7 December 2010 16:06:49 UTC