Re: dataset semantics from Sandro Hawke on 2011-12-19 (public-rdf-wg@w3.org from December 2011)

From: Sandro Hawke <sandro@w3.org>
Date: Mon, 19 Dec 2011 00:26:39 -0500
To: Pat Hayes <phayes@ihmc.us>
Cc: David Wood <david@3roundstones.com>, RDF WG <public-rdf-wg@w3.org>
Message-ID: <1324272399.6252.1515.camel@waldron>
On Sat, 2011-12-17 at 09:58 -0600, Pat Hayes wrote:
> On Dec 16, 2011, at 11:43 PM, Sandro Hawke wrote:
> 
> > On Fri, 2011-12-16 at 22:47 -0600, Pat Hayes wrote:
> >> On Dec 16, 2011, at 10:21 PM, Sandro Hawke wrote:
> >> 
> >>> ... maybe I can figure out some TriG
> >>> entailment tests....    Like, does this TriG document / dataset:
> >>> 
> >>>       { <a> <b> <c> }
> >>> 
> >>> entail this RDF graph:
> >>> 
> >>>   <a> <b> <c>.
> >>> 
> >>> I think it should, so we can have metadata in TriG, but other people
> >>> have disagreed.   How should we be gather test cases like this?
> >> 
> >> 
> >> FWIW, 'entailment' has a fairly precise meaning. A entails B when B is true whenever A is, or more precisely if, for every possible interpretation I, if A is true in I then B is true in I. So it only makes sense to speak of entailment when there is some notion of truth-in-an-interpretation to base it on. 
> > 
> > Yes, I know.
> 
> OK :-)
> > 
> >> So, what are the truth conditions for datasets? 
> > 
> > We haven't quite figured that out yet.   I'm proposing one part of that
> > is that a dataset being true implies its default graph is true.
> 
> Why just the default graph? Aren't queries also directed against the other graphs? Seems to me that the only thing that marks the default graph as being special is that it has no name, which has nothing to do with its truth or falsity.
> 
> BTW, what was the rationale for having a nameless graph in a dataset in the first place? Seems to me that the SPARQL design would be improved if all graphs were required to have some kind of name, and the query was obliged to use the name. After all, this is how the rest of the Web works. 

I wasn't there, but I think there were two very different use cases:

1.  The sparql endpoint is just used to query one graph.  That graph and
the endpoint are tightly bound; neither has a life without the other.
The graph has no need for its own identity apart from that endpoint --
it just lives inside that endpoint.   (Yes, it should have a URI, but at
that point, it seemed like its URI might well be the URI of the
endpoint.  They couldn't quite agree on that, though.)

2.  The sparql endpoint is a proxy for the Web.  When you query it,
you're querying the whole web, or some portion of it.  The endpoint may
fetch as necessary, and cache; you can view it as quick and easy way to
search the data on parts of the Web.

So, those two got merged into this one flexible "dataset" concept.
Along the way, though, there was enough ambiguity and flexibility
introduced that people found they could use the datasets as generic
key-value stores, where the value was a graph and the key was usually a
URI.   Somehow this setup got called "named graphs", which is why I
cringe when I hear the term.

> > 
> > The other part of the truth conditions has to do with the relationship
> > between the things named by the label URIs and the graphs they label.   
> > 
> > Unfortunately, I think we need to allow for several possible
> > relationships there, MAYBE even in the same dataset, which makes things
> > rather complicated.
> 
> Blech. Why do we NEED to do this? 

Well, Dan Brickley was arguing for this most strongly.  I wasn't
convinced.   I was thinking we'd show how to do even this, then maybe
simplify if it turns out not to be needed.    That may be a bad tactic,
since it's much easier to add functionality later than to remove it.

> > One example of the relationship is what I called graphState in a
> > different thread.  In that case, the dataset being true would imply that
> > for each <U,G> in the dataset, the state of the resource U is the graph
> > G.   (Here, I mean "state" and "resource" in exactly the REST sense.)
> 
> And that this graph is true? Ie, is the graph itself asserted when the dataset is asserted? 

(No, discussed in another email thread.)

> > Another example is an out of date version of graphState, maybe call it
> > graphStateWas.  In this case, the dataset being true would imply that
> > for each <U,G> in the dataset, the state of the resource U is, or used
> > to be, graph G.
> 
> Why would we need this? Surely when something is changed, it is no longer asserting what it did before the change. That is kind of the point of allowing change, seems to me.

I might not have framed that quite right.  I was aiming for this:

I fetch some resources, draw some conclusions, and want to be able to
publish them along with references to my sources.  Since I know the
sources can change, I want to make copies of all of them.  Then I'll
refer to the original sources, indicating the time I accessed them, and
pointing the copy I'm maintaining of them, as I saw them at the time.

(Or it could be a copy some shared service is maintaining, like
archive.org.)

> > 
> > Another example of the relationship is something I gather Cambridge
> > Semantics uses, which I'll call subjectOf.   (In one of their deployment
> > modes, triples are divided into two type, which I'll call A and B, based
> > on which predicate they use.  The dataset is constructed such that for
> > each <U, G> in the dataset, every type-A triple in G is of the form
> > { <U> ?P ?O }.  The type-B triples are a little more complicated.)  In
> > this case, the dataset being true would imply the dataset being
> > segmented in this complicated but useful way.   
> 
> With all respect to Cambridge Semantics, if they are the only user of this odd convention, then I really dont think we as a WG should even be considering standardizing it. Unless someone can make a case for why it is going to be generally useful.
> 
> And in any case, this sounds like a syntactic restriction rather than a semantic condition. Having the dataset be segmented is not going to alter the interpretations of any of the triples (is it?). So the semantics (and hence the entailments) can ignore this.

Well, if we're putting this whole concept into the subjectOf predicate,
isn't that considered part of the semantics of that predicate, rather
like a range restriction. 

> > 
> > It's *rather* tempting to just use triples for this, making graphState,
> > graphStateWas, subjectOf, etc, be predicates.   That way the semantics
> > of datasets would be much simpler, with the complications bundled into
> > the semantics of those particular predicates. 
> > 
> > I'm guess I'm suggesting extending the definition of dataset to be a
> > default graph and rather than a set of pairs <U,G>, be a set of triples
> > <U, R, G>, where R is optional.  If R is omitted, you have the kind of
> > dataset we're used to now, where we have no idea what that relation is
> > supposed to be (unless the author tells us humans).
> 
> So I should interpret <U, R, G> to mean that the relation R holds between the resource U and the graph G, and U is *never* simply a name of the graph, is that right? That is we never have the graph  simply being the resource identified by the IRI ?

Well, if R is "=", then you do.   But you have to say that, explicitly.

> > 
> >> Can one assert a dataset (ie claim it to be true)? 
> > 
> > Yes.
> > 
> >> How does one do that? 
> > 
> > The same way you do with RDF.  It kind of depends on your application.
> > Maybe you publish it on the web; maybe you send it to some agent; maybe
> > you publish it and send the URL somewhere, etc.
> 
> And is this in fact done? Do people transmit SPARQL datasets around the Web? What would be a typical transaction involving a dataset? When it is done, what typically happens to the RDF triples in the graphs in the dataset? Do other applications extract them and mash them up with other RDF? Or are they always kept in their dataset 'context'? 

I don't know if anyone is doing this, and I rather doubt anyone is doing
it in a standardized manner.   Well, there's CKAN.   I haven't look at
what's dataset-like there.   (Void uses the word "dataset" to mean
g-box, so it may be a bit hard to tell.)  *shrug*

I think this kind of work, as with RDF, may need to be done rather
speculatively, because the benefits to adopting this technology *before*
it's a standard are so slight.   People exchange datasets inside their
company and with people they know, but without standards for what it
means, why would one publish them on the open Web?   

But, yeah, honestly, I have no idea.    I do know a lot of people asked
for a standard for "Named Graphs", and when I ask what they mean, I
understand them to be saying they want to be able to, in their RDF,
refer to other bits of RDF. 

   -- Sandro


> Pat
> 
> 
> > 
> >   -- Sandro
> > 
> > 
> > 
> > 
> 
> ------------------------------------------------------------
> IHMC                                     (850)434 8903 or (650)494 3973   
> 40 South Alcaniz St.           (850)202 4416   office
> Pensacola                            (850)202 4440   fax
> FL 32502                              (850)291 0667   mobile
> phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
> 
> 
> 
> 
> 
>
Received on Monday, 19 December 2011 05:26:50 UTC