Extracting verified/attributed claims via queries from Dan Brickley on 2010-02-17 (public-xg-socialweb@w3.org from February 2010)

From: Dan Brickley <danbri@danbri.org>
Date: Wed, 17 Feb 2010 18:56:12 +0100
To: dick@blame.ca
Cc: public-xg-socialweb@w3.org
Message-ID: <eb19f3361002170956w2a33b698sb48338a0fd2c6cf1@mail.gmail.com>

Hi Dick,

Just a quick note to pick up on one theme from today's call. You
touched several times on 'verified claims'.

I've been looking at something like this from an RDF perspective,
although there the concept is more 'attributed' than (necessarily)
'verified'. Although the core RDF data model puts everything in simple
flat naive triples, our data access and query spec, SPARQL allows each
triple to be associated with a grouping context. And the SPARQL
language allows query clauses to talk freely about the 'who said what'
bit, as well as the 'what they said' bit. See
http://www.w3.org/TR/rdf-sparql-query/#accessByLabel and
http://www.w3.org/TR/rdf-sparql-query/#accessingRdfGraphs to get a
rough idea.

So - having this machinery make it attractive for us to go figure out
who-said-what with Web data. However RDF's instance format, as well as
most other 'social Web' formats (microformats, portable contacts,
atom, vcard etc.) don't make it very explicit who said what, let alone
whether it has been verified. My little experiment here is to test the
idea that we can document the claim attributions retrospectively, by
writing per-dataset filters which keep just those claims that come
from a single source.

Two examples

1. Advogato user profiles- Bits of this come from the end user; bits
are the result of running advogato trust algorithms over their entire
db
2. BBC Music metadata - Describes bands. Most but not all of the data
on the BBC Music Beta site is sourced at MusicBrainz. Since
MusicBrainz is wiki-like, and includes information about artists
Myspace URIs (which are OpenIDs) this is worth knowing when using the
artist profile data from bbc.co.uk.

My working assumption is that a datasource needs to have an RDF
expression as triples before this method works. Once we have a triples
representation, we simply run a SPARQL CONSTRUCT query against it,
which will filter/mutate the input data, and emit only the bit that
corresponds to some nameable source. Some notes here:
http://svn.foaf-project.org/foaftown/2009/headstream/examples/advogato/readme.txt

Example:
http://svn.foaf-project.org/foaftown/2009/headstream/examples/advogato/connolly-adogato-foaf.rdf
is what Advogato publishes for Dan Connolly.

http://svn.foaf-project.org/foaftown/2009/headstream/examples/advogato/advogato2trust.rq
is a query that strips out everything except the bit that assigns his
URI to some trust group:

PREFIX : <http://xmlns.com/foaf/0.1/>

CONSTRUCT {
?x a :Person .
?x :weblog ?w .
?trustgroup :member ?x .
?trustgroup a :Group .
}

WHERE {
?x a :Person .
?x :weblog ?w .
?trustgroup :member ?x .
}

And here is the resulting - much smaller - file:

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://xmlns.com/foaf/0.1/"
xml:base="http://www.advogato.org/person/connolly/">
<Person rdf:about="file:///Users/danbri/working/foaftown/2009/headstream/examples/advogato/connolly-adogato-foaf.rdf#me">
<weblog rdf:resource="diary.html"/>
</Person>
<Group rdf:about="../../ns/trust#Master">
<member rdf:resource="file:///Users/danbri/working/foaftown/2009/headstream/examples/advogato/connolly-adogato-foaf.rdf#me"/>
</Group>
</rdf:RDF>

I imagine publishers could post such files as a way of documenting in
a flexible fashion which bits of the graph of data they are creating,
and which are simply passed through from another source.

I don't know whose use cases this would fit, but wanted to make a
start on writing it up. This particular technique operates on squiggly
RDF graph data; presumably you could build something similar with
xslt/xquery in an XML context. Not sure what the equivalent in JSON
would look like; maybe something like
http://buzzword.org.uk/2008/jsonGRDDL/spec

The main difference from the 'verified' thing discussed today is the
addition of another explicit level of indirection: we aren't
separating data elements into those that are checked or not; instead
we're tagging each fragment of information with more info about its
source. This pushes the job of deciding which to rely upon down closer
to apps. Generic services can simply pass along the 'who said it'
stuff while withholding judgement on whether it is reliable...

cheers,

Dan

Received on Wednesday, 17 February 2010 17:56:50 UTC