Re: Web Semantics of Datasets (v0.2) from Steve Harris on 2011-10-10 (public-rdf-wg@w3.org from October 2011)

From: Steve Harris <steve.harris@garlik.com>
Date: Mon, 10 Oct 2011 16:42:57 +0100
To: Sandro Hawke <sandro@w3.org>
Cc: public-rdf-wg <public-rdf-wg@w3.org>
Message-Id: <7FDB5F36-6EB5-498E-99B1-A6D807F13168@garlik.com>
A bit of context: I'm being a bit difficult, and I recognise that our use-cases might not be common, but it would seem hypocritical to encourage the WG to recommend against something which we've been doing for a number of years, and works well.

Anyway, comments inline… and apologies for the tl;dr factor :(

On 2011-10-10, at 15:37, Sandro Hawke wrote:

> On Mon, 2011-10-10 at 14:35 +0100, Steve Harris wrote:
>> On 2011-10-10, at 12:30, Sandro Hawke wrote:
>> 
>>> Here's some revised wording for the proposal, getting a bit closer to
>>> spec text.   It's still somewhat informal, and mixing normative and
>>> non-normative bits, and best-practice.   And it's not as clear as it
>>> should be about handling change over time.
>>> 
>>>   -- Sandro
>>> ===
>>> A dataset D is true iff (1) its default graph is true and (2) for
>>> every pair of <N,G> in D, N names something (a "resource", sometimes
>>> called a "g-box") which, at every time T in R, has G as its current
>>> state.
>> 
>> [ apologies in advance for everywhere I've confused a term in logic with an english language term, it's really not my area of expertise ]
>> 
>> I'm not very comfortable with "its default graph is true" — as previously mentioned many systems default to having the default graph be the union of all named graphs (this turns out to be the most practical way to query SPARQL stores in our experience at least), and I doubt you can often determine truthfulness for all your named graphs - depending on what that implies.
>> 
>> Also, in general, I'm not that comfortable with anything that privileges the default graph in terms of "truth", especially as I don't really know what that means. It suggests rather a naïve view of trust, if that's the intent, and if not I'm not sure what the intent is.
>> 
>> It also raises the possibility of a "true" dataset becoming untrue through the use of SPARQL protocol parameters like default-graph-uri, or the FROM keyword. c.f. http://www.w3.org/TR/rdf-sparql-protocol/ §2.1.2.
> 
> I might not be using the term "true" in the technically correct way
> either.  I just mean it in the sense it's already in RDF.  Right now, we
> have some notion of an RDF graph being asserted / true / claimed, etc.
> To be hopelessly boring with my examples, when TimBL publishes at
> http://www.w3.org/People/BernersLee/card#i:

I see, though I'm not sure how that squares with the contents of the named graphs. Some one/thing is asserting everything somewhere.

It feels a bit like there's an assumption of one dataset = one (software) system, which is often not true.

>   timbl:i foaf:name "Tim Berners-Lee"
> 
> he really means that's his name.   If he put a different name there, he
> would be in some sense lying or mistaken.

Well, unless he put "Sir Tim Berners-Lee", or some other variation - foaf:name is not very precisely defined. I'm nit picking obviously, but it's no less true, as of sometime in 2009.

> So, I'm proposing that when you publish/assert a dataset, you are also
> publishing/asserting the default graph.   And you are
> publishing/asserting the connections between some named gboxes and
> snapshots of their contents.  You are *not* asserting the those
> contents.

OK, I guess this comes down to how strong "asserting" is. I think of it in similar terms to something I would put in INSERT INTO t VALUES(…), but that says more about my background than anything else.

> I can see how this might be a problem in your merged default graph case.
> And I can live with changing it, I guess, but I don't know how else to
> convey the metadata in a TriG document.    I think there has to be some
> way to say who the author is, etc.

Sure, but "who the author is" can also be subject to interpretation (30x, proxies, caches, man-in-the-middle attacks etc.). We have several different processes that look at HTML, text, etc. data and produce different data from it, depending on their focus, and how ambiguous the data is. In the general case there is no one "true" blob of metadata for a given g-snap.

> I suppose another solution would be to have a special META keyword in
> TriG, for another graph.   In SPARQL, it might go in the Service
> Description, I guess.

In the past we've appended #something to graph URIs to indicate that it's metadata, e.g.

timbl: {
  timbl:i foaf:name "Tim Berners-Lee" .
  …
}

timbl:#meta {
  <> dc:subject timbl: ;
     dc:date "2011-10-10T16:05:23Z"^^xsd:dateTime ;
     dc:creator <some-tool> .
  …
}

That's not what we do now, but not for any deep technical reason - it just turned out not to be a good model for our specific use. We now inject the metadata inline with the data, or alternatively we only gather metadata, depending on your view.

I don't regard timbl:#meta as being any more or less asserted than timbl:, one was fetched from the web (which is fallible, c.f. DigiNotar), and one was written as a side effect of its processing, by a tool.

I'm also not yet clear on how you would represent changes over time, timbl: has the same "graph URI" on 2011-10-09 and 2011-10-10, but may well have different content. Some systems need to be able to represent that unambiguously.

- Steve

>>> It follows from AWWW that if N is an IRI which can be dereferenced,
>>> a successful, correct dereference of N at any time T in R must yield
>>> a serialization ("representation") of G.
>>> 
>>> In order to know whether a dereference occurs at a time in R, it is
>>> useful to have R declared in the default graph of D, or in another
>>> nearby, easy-to-find data source.  Where possible, is is helpful to
>>> have R be All Time; that is, having N name a resource whose state,
>>> by definition, never changes.
>>> 
>>> In RDF data, N may be used (1) directly, to name the g-box,
>>> expressing things like the license that applies to its state, or who
>>> controls it; and (2) indirectly, to refer to G as the current state
>>> of the g-box.  Indirect reference can be used to express things
>>> about an RDF Graph (a "g-snap"), like that it was the graph some
>>> entity asserted at some time.  Indirection is done in the semantics
>>> of the predicates with which N is used.
>>> 
>>> When N is used indirectly, the reference to G only holds inside time
>>> range R, of course.  Care must be taken not to use N as if it
>>> necessarily referred to G, outside of R.  Since R is defined to be
>>> the same for all elements of D, indirect reference is safe in the
>>> default graph.   
>>> 
>>> 
>>> 
>>> 
>> 
> 
> 
> 

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Monday, 10 October 2011 15:43:28 UTC