Re: dataset semantics

Sandro, I have the overall feeling from this thread that those issues are so vague and speculative, and so outside the scope of RDF itself or of what people are doing with it, that we should not even be discussing them here.  Details in line.

On Dec 18, 2011, at 11:26 PM, Sandro Hawke wrote:

> On Sat, 2011-12-17 at 09:58 -0600, Pat Hayes wrote:
>> On Dec 16, 2011, at 11:43 PM, Sandro Hawke wrote:
>> 
>>> On Fri, 2011-12-16 at 22:47 -0600, Pat Hayes wrote:
>>>> On Dec 16, 2011, at 10:21 PM, Sandro Hawke wrote:
>>>> 
>>>>> ... maybe I can figure out some TriG
>>>>> entailment tests....    Like, does this TriG document / dataset:
>>>>> 
>>>>>      { <a> <b> <c> }
>>>>> 
>>>>> entail this RDF graph:
>>>>> 
>>>>>  <a> <b> <c>.
>>>>> 
>>>>> I think it should, so we can have metadata in TriG, but other people
>>>>> have disagreed.   How should we be gather test cases like this?
>>>> 
>>>> 
>>>> FWIW, 'entailment' has a fairly precise meaning. A entails B when B is true whenever A is, or more precisely if, for every possible interpretation I, if A is true in I then B is true in I. So it only makes sense to speak of entailment when there is some notion of truth-in-an-interpretation to base it on. 
>>> 
>>> Yes, I know.
>> 
>> OK :-)
>>> 
>>>> So, what are the truth conditions for datasets? 
>>> 
>>> We haven't quite figured that out yet.   I'm proposing one part of that
>>> is that a dataset being true implies its default graph is true.
>> 
>> Why just the default graph? Aren't queries also directed against the other graphs? Seems to me that the only thing that marks the default graph as being special is that it has no name, which has nothing to do with its truth or falsity.
>> 
>> BTW, what was the rationale for having a nameless graph in a dataset in the first place? Seems to me that the SPARQL design would be improved if all graphs were required to have some kind of name, and the query was obliged to use the name. After all, this is how the rest of the Web works. 
> 
> I wasn't there, but I think there were two very different use cases:
> 
> 1.  The sparql endpoint is just used to query one graph.  That graph and
> the endpoint are tightly bound; neither has a life without the other.
> The graph has no need for its own identity apart from that endpoint --
> it just lives inside that endpoint.   (Yes, it should have a URI, but at
> that point, it seemed like its URI might well be the URI of the
> endpoint.  They couldn't quite agree on that, though.)
> 
> 2.  The sparql endpoint is a proxy for the Web.  When you query it,
> you're querying the whole web, or some portion of it.  The endpoint may
> fetch as necessary, and cache; you can view it as quick and easy way to
> search the data on parts of the Web.
> 
> So, those two got merged into this one flexible "dataset" concept.
> Along the way, though, there was enough ambiguity and flexibility
> introduced that people found they could use the datasets as generic
> key-value stores, where the value was a graph and the key was usually a
> URI.   Somehow this setup got called "named graphs", which is why I
> cringe when I hear the term.
> 
>>> 
>>> The other part of the truth conditions has to do with the relationship
>>> between the things named by the label URIs and the graphs they label.   
>>> 
>>> Unfortunately, I think we need to allow for several possible
>>> relationships there, MAYBE even in the same dataset, which makes things
>>> rather complicated.
>> 
>> Blech. Why do we NEED to do this? 
> 
> Well, Dan Brickley was arguing for this most strongly.

Hmm, I wonder what was going on in his mind at that point. 

>  I wasn't
> convinced.   I was thinking we'd show how to do even this, then maybe
> simplify if it turns out not to be needed.    That may be a bad tactic,
> since it's much easier to add functionality later than to remove it.
> 
>>> One example of the relationship is what I called graphState in a
>>> different thread.  In that case, the dataset being true would imply that
>>> for each <U,G> in the dataset, the state of the resource U is the graph
>>> G.   (Here, I mean "state" and "resource" in exactly the REST sense.)
>> 
>> And that this graph is true? Ie, is the graph itself asserted when the dataset is asserted? 
> 
> (No, discussed in another email thread.)
> 
>>> Another example is an out of date version of graphState, maybe call it
>>> graphStateWas.  In this case, the dataset being true would imply that
>>> for each <U,G> in the dataset, the state of the resource U is, or used
>>> to be, graph G.
>> 
>> Why would we need this? Surely when something is changed, it is no longer asserting what it did before the change. That is kind of the point of allowing change, seems to me.
> 
> I might not have framed that quite right.  I was aiming for this:
> 
> I fetch some resources, draw some conclusions, and want to be able to
> publish them along with references to my sources.  Since I know the
> sources can change, I want to make copies of all of them.  Then I'll
> refer to the original sources, indicating the time I accessed them, and
> pointing the copy I'm maintaining of them, as I saw them at the time.
> 
> (Or it could be a copy some shared service is maintaining, like
> archive.org.)

This might be in interesting topic, but I dont see what it has to do with datasets particularly. And why keep the copy AND point to the original source? That seems like overkill. 

> 
>>> 
>>> Another example of the relationship is something I gather Cambridge
>>> Semantics uses, which I'll call subjectOf.   (In one of their deployment
>>> modes, triples are divided into two type, which I'll call A and B, based
>>> on which predicate they use.  The dataset is constructed such that for
>>> each <U, G> in the dataset, every type-A triple in G is of the form
>>> { <U> ?P ?O }.  The type-B triples are a little more complicated.)  In
>>> this case, the dataset being true would imply the dataset being
>>> segmented in this complicated but useful way.   
>> 
>> With all respect to Cambridge Semantics, if they are the only user of this odd convention, then I really dont think we as a WG should even be considering standardizing it. Unless someone can make a case for why it is going to be generally useful.
>> 
>> And in any case, this sounds like a syntactic restriction rather than a semantic condition. Having the dataset be segmented is not going to alter the interpretations of any of the triples (is it?). So the semantics (and hence the entailments) can ignore this.
> 
> Well, if we're putting this whole concept into the subjectOf predicate,
> isn't that considered part of the semantics of that predicate, rather
> like a range restriction. 

Ah, you didnt say that subjectOf was an RDF property. That complicates things rather drastically, since now we have some RDF describing the syntactic properties of some other RDF. You might remember, we started to go there once before, and decided to not even try. (Reification?) I would still prefer to not even try to get this straight. 

>>> It's *rather* tempting to just use triples for this, making graphState,
>>> graphStateWas, subjectOf, etc, be predicates.   That way the semantics
>>> of datasets would be much simpler, with the complications bundled into
>>> the semantics of those particular predicates. 

Now I understand what you were saying, I disagree. It wouldn't be simpler.

>>> 
>>> I'm guess I'm suggesting extending the definition of dataset to be a
>>> default graph and rather than a set of pairs <U,G>, be a set of triples
>>> <U, R, G>, where R is optional.  If R is omitted, you have the kind of
>>> dataset we're used to now, where we have no idea what that relation is
>>> supposed to be (unless the author tells us humans).
>> 
>> So I should interpret <U, R, G> to mean that the relation R holds between the resource U and the graph G, and U is *never* simply a name of the graph, is that right? That is we never have the graph  simply being the resource identified by the IRI ?
> 
> Well, if R is "=", then you do.   But you have to say that, explicitly.

How? RDF doesn't have an equality predicate. 

> 
>>> 
>>>> Can one assert a dataset (ie claim it to be true)? 
>>> 
>>> Yes.
>>> 
>>>> How does one do that? 
>>> 
>>> The same way you do with RDF.  It kind of depends on your application.
>>> Maybe you publish it on the web; maybe you send it to some agent; maybe
>>> you publish it and send the URL somewhere, etc.
>> 
>> And is this in fact done? Do people transmit SPARQL datasets around the Web? What would be a typical transaction involving a dataset? When it is done, what typically happens to the RDF triples in the graphs in the dataset? Do other applications extract them and mash them up with other RDF? Or are they always kept in their dataset 'context'? 
> 
> I don't know if anyone is doing this, and I rather doubt anyone is doing
> it in a standardized manner.   

Why are we even talking about this, then? 

> Well, there's CKAN.   I haven't look at
> what's dataset-like there.   (Void uses the word "dataset" to mean
> g-box, so it may be a bit hard to tell.)  *shrug*
> 
> I think this kind of work, as with RDF, may need to be done rather
> speculatively, because the benefits to adopting this technology *before*
> it's a standard are so slight.  

I confess I don't see what the reasons for doing it would be even if we did standardize it. 

>  People exchange datasets inside their
> company and with people they know, but without standards for what it
> means, why would one publish them on the open Web?   

Why would one publish a dataset rather than a graph? What can be said using a dataset that cannot be said using a named graph? 

> 
> But, yeah, honestly, I have no idea.    I do know a lot of people asked
> for a standard for "Named Graphs", and when I ask what they mean, I
> understand them to be saying they want to be able to, in their RDF,
> refer to other bits of RDF. 

That is fine. But that doesnt seem to have anything much to do with datasets. It certainly doesnt *require* datasets. Seems to me that this whole notion of a dataset is something that SPARQL introduced, and we could usefully leave it to SPARQL and just focus on RDF.

Pat

> 
>   -- Sandro
> 
> 
>> Pat
>> 
>> 
>>> 
>>>  -- Sandro
>>> 
>>> 
>>> 
>>> 
>> 
>> ------------------------------------------------------------
>> IHMC                                     (850)434 8903 or (650)494 3973   
>> 40 South Alcaniz St.           (850)202 4416   office
>> Pensacola                            (850)202 4440   fax
>> FL 32502                              (850)291 0667   mobile
>> phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 

------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973   
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes

Received on Monday, 19 December 2011 07:44:35 UTC