Re: Provenance for section 3 in technologies.tex

On Thursday, June 26, 2003, at 02:11  PM, Butler, Mark wrote:

>
> Hi Dave
>
>> Non-standard extensions would be best avoided if you want
>> SIMILE to be a full
>> participant in the semantic web.
>
> But to take this back to my original suggestion does this apply to 
> quads? My
> understanding from Andy is that they are used by RDFStore and a number 
> of
> RSS processors, and from Jeremy that although Jena 2 does not have a 
> quads
> API it does actually use a quad data structure "under the hood". So 
> although
> they are non-standard at the moment, people are using them, so should 
> we
> really rule them out?

hi Mark,

in our interpretation of provenance/contexts in RDFStore we assumed 
that a statement represents a fact that is asserted as true in a 
certain context. This circumstance (e.g. space/temporal, situation or 
scope) where the statement has been stated represents “contextual” 
information about the statement [1][2]. For example, when triples are 
being added to a graph it is often useful to be able to track back 
where they came from (e.g. Internet source Web site or domain), how 
they were added, by whom, why, when (e.g. date), when they will expire 
(e.g. Time-To-Live) and so on. Such context (or provenance information) 
can be thought of as an additional and orthogonal dimension to the 
other 3 components. This concept is not part of the current RDF data 
model [3] and referred to as “statement reification". From the 
application developer point of view there is a clear need for such 
primitive constructs to layer different levels of semantics on top of 
RDF which can not be represented in the RDF triples space. Applications 
normally need to build meta-levels of abstraction over triples to 
reduce complexity and provide an incremental and scaleable access to 
information. For example, if a Web robot is processing and syndicating 
news coming from various on-line newspapers, there will be overlap. An 
application may decide to filter the news based not only on a timeline 
or some other property, but perhaps select sources providing only 
certain information with unique characteristics. This requires the 
flagging of triples as belonging to different contexts and then 
describing in the RDF itself the relationships between the contexts. At 
query time such information can then be used by the application to 
define a search scope to filter the results. Another common example of 
the usage of provenance and contextual information is about digital 
signing RDF triples to provide a basic level of trust over the 
Semantic. In that case triples could be flagged for example with a PGP 
key to uniquely identify the source and its properties. There have been 
several attempts [4][5][6][7] trying to formalize and use contexts and 
provenance information in RDF but there is not yet a common agreement 
how to do it. It is also not completely clear how an application would 
benefit from this information. Jena2 seems is also trying some steps in 
that direction too.
Our approach to model contexts and provenance has been simpler and 
motivated by real-world RDF applications we have developed [8][9]. We 
found that an additional dimension to the RDF triple can be useful or 
even essential. Given that the usage of full-blown RDF reification  can 
be cumbersome due to its verbosity and inefficiency, we developed a 
different modeling technique that flags or mark a given statement as 
belonging to one or more specific contexts.

On the practical side, our Perl/C API allows to add/remove and search 
triples into specific "spaces" or contexts and serialize them back as 
Quads (simple extension to N-Triples syntax) - at the moment we are 
about to implement a serialization of context back to RDF/XML (also as 
Jan suggested) via the rdf:ID reification stuff and at parse time will 
just flag those triples (predicates) as "special" or asserted in a 
different context - in the past we used rdf:bagID for to hack this 
functionality but it has been recently dropped from the specs as you 
probably noticed. At the RDQL query level we allow a 4-th component as 
URI (resource) on triple-patterns to specify/select the context - the 
nice part of it is that sub-sequent triple-patterns can refine and 
select the vars from that 4-th component to "unify" descriptions of 
different levels.

As an example, as presented at the WWW2003 devday, we have some demo 
queries using contexts available

http://demo.asemantics.com/rdfstore/www2003/

The example database contains scraped news from most italian 
newspapers, where each channel and news item is put into a specific 
source context - this allows us to filter results by date, by source 
avoiding overlaps and clashing of URLs (eg. some newspapers recycling 
the same URL every day but with different HTML content). In particular 
look at the last two queries (number 9 and 10) using contextual 
information at the RDQL level - the very last one is pretty cool to me, 
which allows to describe the 4-th context component with a dc:date and 
then join it into the other triple space.

BTW: while at www2003 I had a chat with Matt Biddulph about his RSS 
codepiction code/demo and he seems to have similar problems and 
solutions using Jena with reification to mimic contextual information - 
that means that this aspect is going to fundamental for the success of 
the whole Semantic Web and RDF systems to me

but yes, all this is not "standard" :-)

hope this helps

all the best

Alberto

[1] Graham Klyne, 13-Mar-2002 “Circumstance, provenance and partial 
knowledge - Limiting the scope of RDF assertions” 
http://www.ninebynine.org/RDFNotes/UsingContextsWithRDF.html
[2] John F. Sowa, “Knowledge Representation: Logical, Philosophical, 
and Computational Foundations”, Brooks Cole Publishing Co., ISBN 
0-534-94965-7
[3] Patrick Hayes “RDF Semantics” (W3C Working Draft 23 January 2003) 
http://www.w3.org/TR/rdf-mt/
[4] Graham Klyne, 18 October 2000 “Contexts for RDF Information 
Modelling” http://public.research.mimesweeper.com/RDF/RDFContexts.html
[5] Seth Russel, 7 August 2002 “Quads” 
http://robustai.net/sailor/grammar/Quads.html
[6] T. Berners-Lee, Dan Connoly “Notation 3” 
http://www.w3.org/2000/10/swap/doc/Overview.html
[7] Dave Beckett, “Contexts Thoughts" 
http://www.redland.opensource.ac.uk/notes/contexts.html
[8] http://demo.asemantics.com/biz/isc/
[9] http://demo.asemantics.com/biz/lmn/




>
> I'd be interested in feedback here from Eric Miller and David Karger 
> also?
>
> thanks
>
> Mark

Received on Thursday, 26 June 2003 09:33:00 UTC