Re: Provenance for section 3 in technologies.tex

Alberto made a huge point:

> in our interpretation of provenance/contexts in RDFStore we assumed
> that a statement represents a fact that is asserted as true in a
> certain context. This circumstance (e.g. space/temporal, situation or
> scope) where the statement has been stated represents “contextual”
> information about the statement [1][2]. For example, when triples are
> being added to a graph it is often useful to be able to track back
> where they came from (e.g. Internet source Web site or domain), how
> they were added, by whom, why, when (e.g. date), when they will expire
> (e.g. Time-To-Live) and so on. Such context (or provenance information)
> can be thought of as an additional and orthogonal dimension to the
> other 3 components. This concept is not part of the current RDF data
> model [3] and referred to as “statement reification". From the
> application developer point of view there is a clear need for such
> primitive constructs to layer different levels of semantics on top of
> RDF which can not be represented in the RDF triples space....

JSE: The notion of preserving the context of a statement WITHOUT TRANSFORMING
THAT STATEMENT is critical for RDF application developers and I believe is
being overlooked. I believe RDF's current approach, which reifies the
statement, is artifically invasive and complex.

In a real world, statements will be conceptually contained, aggregated and
nested; it seems crazy that in order to deal with them in such a way, we must
artificially blow them apart.

For another argument about the need to easily nest (and how reification and
RDF/XML introduces unnecessary complexity) see:

http://purl.oclc.org/NET/RDF_M_S_Revisited  (PDF)

Given a statement like:

[s,p,o]

...provenance simply means we want the ability to make a statement ABOUT that
statement without changing that statement, as in:

[s1,p1,[s,p,o]]

...while preserving the intention of this second statement, which is for the
first triple to be the object of a second triple. If we have a quad store,
this looks like:

[i1,s,p,o]
[i2,s1,p1,i1]

...in which we are using the "4th element" as a statement identify. We have
the ability to *explicitly* define context membership in the following way:

[i1,s1,p1,o1]
[i2,s2,p2,o2]
[i3,c1,p3,i1]
[i4,c1,p3,i2]

...in which subject c1 is the context identifier and p3 is a "contains"
predicate (jse:contains].

We can also define context membership *implicitly* as follows:

[c1,s1,p1,o1]
[c1,s2,p2,o2]
[c2,s3,p3,o3]
[c2,s4,p4,c1]

There are two contexts shown, each containing two arbitrary statements. The
first context c1 contains triples [s1,p1,o1] and [s2,p2,o2]. The context c2
contains [s3,p3,o3] and [s4,p4,c1]; this second statement happens to have as
its object c1, thus illustrating nesting.

This example shows two different ways of constructing application-level
abstractions for containment, one explicit and one implicit, both leveraging
quads and neither one artificially trashing the contained statements...John

> ...Applications
> normally need to build meta-levels of abstraction over triples to
> reduce complexity and provide an incremental and scaleable access to
> information. For example, if a Web robot is processing and syndicating
> news coming from various on-line newspapers, there will be overlap. An
> application may decide to filter the news based not only on a timeline
> or some other property, but perhaps select sources providing only
> certain information with unique characteristics. This requires the
> flagging of triples as belonging to different contexts and then
> describing in the RDF itself the relationships between the contexts. At
> query time such information can then be used by the application to
> define a search scope to filter the results. Another common example of
> the usage of provenance and contextual information is about digital
> signing RDF triples to provide a basic level of trust over the
> Semantic. In that case triples could be flagged for example with a PGP
> key to uniquely identify the source and its properties. There have been
> several attempts [4][5][6][7] trying to formalize and use contexts and
> provenance information in RDF but there is not yet a common agreement
> how to do it. It is also not completely clear how an application would
> benefit from this information. Jena2 seems is also trying some steps in
> that direction too.
> Our approach to model contexts and provenance has been simpler and
> motivated by real-world RDF applications we have developed [8][9]. We
> found that an additional dimension to the RDF triple can be useful or
> even essential. Given that the usage of full-blown RDF reification  can
> be cumbersome due to its verbosity and inefficiency, we developed a
> different modeling technique that flags or mark a given statement as
> belonging to one or more specific contexts.
>
> On the practical side, our Perl/C API allows to add/remove and search
> triples into specific "spaces" or contexts and serialize them back as
> Quads (simple extension to N-Triples syntax) - at the moment we are
> about to implement a serialization of context back to RDF/XML (also as
> Jan suggested) via the rdf:ID reification stuff and at parse time will
> just flag those triples (predicates) as "special" or asserted in a
> different context - in the past we used rdf:bagID for to hack this
> functionality but it has been recently dropped from the specs as you
> probably noticed. At the RDQL query level we allow a 4-th component as
> URI (resource) on triple-patterns to specify/select the context - the
> nice part of it is that sub-sequent triple-patterns can refine and
> select the vars from that 4-th component to "unify" descriptions of
> different levels.
>
> As an example, as presented at the WWW2003 devday, we have some demo
> queries using contexts available
>
> http://demo.asemantics.com/rdfstore/www2003/
>
> The example database contains scraped news from most italian
> newspapers, where each channel and news item is put into a specific
> source context - this allows us to filter results by date, by source
> avoiding overlaps and clashing of URLs (eg. some newspapers recycling
> the same URL every day but with different HTML content). In particular
> look at the last two queries (number 9 and 10) using contextual
> information at the RDQL level - the very last one is pretty cool to me,
> which allows to describe the 4-th context component with a dc:date and
> then join it into the other triple space.
>
> BTW: while at www2003 I had a chat with Matt Biddulph about his RSS
> codepiction code/demo and he seems to have similar problems and
> solutions using Jena with reification to mimic contextual information -
> that means that this aspect is going to fundamental for the success of
> the whole Semantic Web and RDF systems to me
>
> but yes, all this is not "standard" :-)
>
> hope this helps
>
> all the best
>
> Alberto
>
> [1] Graham Klyne, 13-Mar-2002 “Circumstance, provenance and partial
> knowledge - Limiting the scope of RDF assertions”
> http://www.ninebynine.org/RDFNotes/UsingContextsWithRDF.html
> [2] John F. Sowa, “Knowledge Representation: Logical, Philosophical,
> and Computational Foundations”, Brooks Cole Publishing Co., ISBN
> 0-534-94965-7
> [3] Patrick Hayes “RDF Semantics” (W3C Working Draft 23 January 2003)
> http://www.w3.org/TR/rdf-mt/
> [4] Graham Klyne, 18 October 2000 “Contexts for RDF Information
> Modelling” http://public.research.mimesweeper.com/RDF/RDFContexts.html
> [5] Seth Russel, 7 August 2002 “Quads”
> http://robustai.net/sailor/grammar/Quads.html
> [6] T. Berners-Lee, Dan Connoly “Notation 3”
> http://www.w3.org/2000/10/swap/doc/Overview.html
> [7] Dave Beckett, “Contexts Thoughts"
> http://www.redland.opensource.ac.uk/notes/contexts.html
> [8] http://demo.asemantics.com/biz/isc/
> [9] http://demo.asemantics.com/biz/lmn/
>
>
>
>
> >
> > I'd be interested in feedback here from Eric Miller and David Karger
> > also?
> >
> > thanks
> >
> > Mark
>
>
>

Received on Friday, 27 June 2003 12:04:31 UTC