Re: three kinds of dataset from Charles Greer on 2012-03-06 (public-rdf-wg@w3.org from March 2012)

From: Charles Greer <cgreer@marklogic.com>
Date: Tue, 06 Mar 2012 08:34:20 -0800
To: Pat Hayes <phayes@ihmc.us>
CC: RDF-WG WG <public-rdf-wg@w3.org>
Message-ID: <4F563C8C.5000700@marklogic.com>
Thank you for taking the time to clarify and restate the positions.  I 
think you've done it well and the discussion of semantics for RDF graphs 
has been really interesting to me.  Your proposal of property extension 
is appealing...

I remain, I hope humbly, a proponent of keeping graph semantics out of 
RDF.  Why?  Two reasons.

First, to go back to something Guus said, the scope of this current WG 
mandate is limited, and digging into the semantics of quads is a big 
deal.  It might be prohibitively expensive from an administrative point 
of view.

Secondly, and to me most importantly, there's great value in the way 
that RDF is designed right now, and while it may be important to know 
whether or not a particular graph is consistent or has a context, it's 
crucial that RDF be able to express consistent graphs, inconsistent 
graphs, disjoint graphs or whatever other kind of node/edge/node you can 
think of.  It is this ability of RDF that is its greatest strength - we 
can fill a dataset with values and decide later on whether to close the 
world, assert consistency, or even treat a dataset as a bunch of 
disparate graphs.

Case 2 and 3 look like expectations placed atop of case 1.  That is, a 
set of triples already exists before we can add the semantics to 
invalidate particular triples in this context.  Even to ask 'is this a 
consistent set of triples' predicates that some standard support the 
ability to group semantically inconsistent triples.

The most valuable thing about RDF datasets in my experience has been 
that you can defer validation and schema until after the data exists, 
letting practitioners separate the management of data from the 
management of knowledge.  I think this amounts to a mandate to decouple 
the data layer from anything that smacks of schema.  Then again now that 
i think about it, mistaken cardinality can really hurt.

I have no intention of dismissing the semantic work here, and the idea 
of extending the semantics of properties is very compelling.  It however 
looks big.  And when I see OWL properties used to illustrate the use 
cases, it seems to me that we're reaching too far.  I hope to be 
illuminated further,

Charles



On 03/05/2012 10:55 PM, Pat Hayes wrote:
> Ive been trying to pull all these threads together. Seems to me that the use cases for quads/datasets fall into three main categories, which demand different semantic approaches if we are going to try to avoid interoperability confusion. (Now I understand Antoine's proposal I see how it manages to be a kind of weakest-possible-blanket-case which allows something like all three of these to kind of work, but I would argue that we can do better, because this fit-all approach doesn't really fit anything quite properly. More below.)
>
> Case 1. Datasets are collections of RDF graphs distinguished from one another by 'labels', used essentially as a bookkeeping device to distinguish one graph from another, to keep entailments from one graph distinguished from those of another, etc..  No actual semantic relationship is assumed to hold between a graph and its label, and each graph is a normal RDF graph to which the 2004 RDF semantics applies. There is no difference in meaning between a labelled graph and the same graph outside the dataset, without the label. No particular meaning is given to the idea of 'asserting' a dataset.
>
> Case 2. The graph labels in a dataset are presumed to indicate a context of some kind in which the labeled graph is understood to be be true or to hold. To assert the dataset is to assert that each graph holds *in its context* but it may not do so outside the context, so no semantic relationship, eg of entailment, holds between the named graphs in a dataset and any unnamed graphs, even the same graph without its label. (Contexts might include timeperiods, locations, beliefs, sources, "Islands", etc..: anything which is thought of as influencing the truth of something expressed in RDF.)
>
> Case 3. The graph labels in a dataset are understood to be actual names of the graph they are associated with, ie to formally denote the graph, so that when used in RDF these labels refer to the actual graph. (Or maybe, to some larger graph of which the graph indicated is a part.) The labelled graphs are then essentially being mentioned rather than used, so that the dataset can be asserted without in any way asserting the component named graphs it contains. These named graphs are more like graph literals than a graph in an RDF graph document.  (There is also the idea that the label actually names a graph container whose state is initially the graph shown, and no doubt other variations on this theme are possible; let me lump all these together for now.)
>
> So, take Tim Lebo's example from David's recent email:
>
>> :account_1 {
>>      :entity a prov:Entity
>> }
>>
>> :account_2 {
>>      :entity a prov:Activity
>> }
>>
>> prov:Entity owl:disjointWith prov:Activity .
> and presume that we are accepting OWL semantics. Case 1 says: yes, these three are OWL-inconsistent taken together. (Of course it allows that you might not want to take them together, perhaps even that you should not take them together, but as far as what they mean, they are indeed mutually inconsistent.)  Case 2 says: no, these three are OWL-consistent, even taken together, because :entity can be a prov:Entity in one context and something else in a different context, and this is consistent. Case 3 also says they are consistent, but for a different reason: only the last triple is being asserted; the two named graphs don't say *anything* about :entity, only that certain graphs are named ":account_1" and ":account_2". (But if these graphs were to have their content exposed, eg by importing them using their names into a graph containing the third, then there would indeed be a good old 2004 inconsistency in that graph.)
>
> These really need different semantic treatments.
>
> I maintain that the first case does not need any changes to the 2004 semantics at all, and does not require that datastores be given any special semantics. In fact, it is better if they are not, as any semantic story beyond the 2004 account of graph meanings will be harmful to some appllication or other. Graph names here are purely an organizing and record-keeping device, and can be freely used in any way, and nothing is changed about RDF by any such use. For example, it would be fine to decide that a graph-label association was local to a datastore, on this view.
>
> The third case is closest to the original Bizer et. al. named graph proposal, and supports the same kind of graphs-as-resources thinking, in which the URI of a graph document is seen as identifying the graph just as URIs identify Web pages and the like. Graph labels here have global scope, and one can treat a graph label as the name of the graph in a very strong sense, use that URi in RDF to refer to the graph (or maybe to the graph container, or maybe to either, etc..: again, let me ignore this complication for the present.) To assert a datastore is a kind of graph baptism: publishing the datatore assigns a global name to the graph, and requires that satisfying interpretations respect this naming. (The semantic conditions are in the original paper, but in essence they are that an interpretation I satisfies
>   label {graph}
> just when I(label) = graph.  I'm tempted to say, "duh.")
>
> The second case is the tricky one, because it has the label actually changing the meaning of the triples in the graph. If we are to claim that graphs can be true in one context but not in another, then we have to change the 2004 semantics somehow in order to provide for this context sensitivity. This is where Antoine's approach and mine differ. His proposal allows the *meanings of URIs* to change with context (as well as being the names of the contexts themselves); mine only allows relations to have an extra parameter. The 'context' is then this extra parameter which allows truthvalues of triples to appear to change, by treating them as quads; but the URIs remain having a global meaning.  Antoine's semantics requires adding a context mapping to interpretations, so that every URI defines a potentially different interpretation context for every other URI; mine requires allowing the EXT mapping on RDF properties to admit triples as well as pairs. Neither of them change current RDF graph meanings, but they extend this to datasets differently. Mine is a semantic extension to RDF, while Antoine's is a kind of semantic un-extension: it gives a weaker meaning than the RDF semantics does.
>
> OK, more later. I just wanted to get these distinctions out into the open. My main point is that these are *different*, and to suggest that we should provide ways to distinguish them. One way, for example, might be to give TriG Antoine's semantics (which does not overly interfere with the first case) and to give N-Quads my semantics, and think of (or choose) another syntax for the third case. (BTW, in my earlier email I suggested the use of the + instead of dot to distinguish the 'contextual' case from the plain RDF triple case. This makes sense in my proposal, but AFAIKS not in Antoine's. it allows case 2 to be mixed with case 1 as two kinds of data in a single dataset. Maybe this much flexibility is overkill, however.)
>
> There are many other issues, like how to distinguish graphs from graph containers; whether we are naming/labeling the graph shown or some other, larger, graph; whether it is good to use RDF to itself describe the pragmatic or semantic alternatives;  and how to combine these various senses if we need to. But I think at least keeping them separate is a useful way to move forward and avoid some of the, er, philosophical disputes.
>
> Pat
>
> ------------------------------------------------------------
> IHMC                                     (850)434 8903 or (650)494 3973
> 40 South Alcaniz St.           (850)202 4416   office
> Pensacola                            (850)202 4440   fax
> FL 32502                              (850)291 0667   mobile
> phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
>
>
>
>
>
>


-- 
Charles Greer
Senior Engineer
MarkLogic Corporation
charles.greer@marklogic.com
Phone: +1 707 408 3277
www.marklogic.com

This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this e-mail communication by others is strictly prohibited. If you are not the intended recipient, please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation.
Received on Tuesday, 6 March 2012 16:34:53 UTC