three kinds of dataset

Ive been trying to pull all these threads together. Seems to me that the use cases for quads/datasets fall into three main categories, which demand different semantic approaches if we are going to try to avoid interoperability confusion. (Now I understand Antoine's proposal I see how it manages to be a kind of weakest-possible-blanket-case which allows something like all three of these to kind of work, but I would argue that we can do better, because this fit-all approach doesn't really fit anything quite properly. More below.)

Case 1. Datasets are collections of RDF graphs distinguished from one another by 'labels', used essentially as a bookkeeping device to distinguish one graph from another, to keep entailments from one graph distinguished from those of another, etc..  No actual semantic relationship is assumed to hold between a graph and its label, and each graph is a normal RDF graph to which the 2004 RDF semantics applies. There is no difference in meaning between a labelled graph and the same graph outside the dataset, without the label. No particular meaning is given to the idea of 'asserting' a dataset. 

Case 2. The graph labels in a dataset are presumed to indicate a context of some kind in which the labeled graph is understood to be be true or to hold. To assert the dataset is to assert that each graph holds *in its context* but it may not do so outside the context, so no semantic relationship, eg of entailment, holds between the named graphs in a dataset and any unnamed graphs, even the same graph without its label. (Contexts might include timeperiods, locations, beliefs, sources, "Islands", etc..: anything which is thought of as influencing the truth of something expressed in RDF.) 

Case 3. The graph labels in a dataset are understood to be actual names of the graph they are associated with, ie to formally denote the graph, so that when used in RDF these labels refer to the actual graph. (Or maybe, to some larger graph of which the graph indicated is a part.) The labelled graphs are then essentially being mentioned rather than used, so that the dataset can be asserted without in any way asserting the component named graphs it contains. These named graphs are more like graph literals than a graph in an RDF graph document.  (There is also the idea that the label actually names a graph container whose state is initially the graph shown, and no doubt other variations on this theme are possible; let me lump all these together for now.)

So, take Tim Lebo's example from David's recent email:

> :account_1 {
>     :entity a prov:Entity
> }
> 
> :account_2 {
>     :entity a prov:Activity
> }
> 
> prov:Entity owl:disjointWith prov:Activity .

and presume that we are accepting OWL semantics. Case 1 says: yes, these three are OWL-inconsistent taken together. (Of course it allows that you might not want to take them together, perhaps even that you should not take them together, but as far as what they mean, they are indeed mutually inconsistent.)  Case 2 says: no, these three are OWL-consistent, even taken together, because :entity can be a prov:Entity in one context and something else in a different context, and this is consistent. Case 3 also says they are consistent, but for a different reason: only the last triple is being asserted; the two named graphs don't say *anything* about :entity, only that certain graphs are named ":account_1" and ":account_2". (But if these graphs were to have their content exposed, eg by importing them using their names into a graph containing the third, then there would indeed be a good old 2004 inconsistency in that graph.)

These really need different semantic treatments. 

I maintain that the first case does not need any changes to the 2004 semantics at all, and does not require that datastores be given any special semantics. In fact, it is better if they are not, as any semantic story beyond the 2004 account of graph meanings will be harmful to some appllication or other. Graph names here are purely an organizing and record-keeping device, and can be freely used in any way, and nothing is changed about RDF by any such use. For example, it would be fine to decide that a graph-label association was local to a datastore, on this view. 

The third case is closest to the original Bizer et. al. named graph proposal, and supports the same kind of graphs-as-resources thinking, in which the URI of a graph document is seen as identifying the graph just as URIs identify Web pages and the like. Graph labels here have global scope, and one can treat a graph label as the name of the graph in a very strong sense, use that URi in RDF to refer to the graph (or maybe to the graph container, or maybe to either, etc..: again, let me ignore this complication for the present.) To assert a datastore is a kind of graph baptism: publishing the datatore assigns a global name to the graph, and requires that satisfying interpretations respect this naming. (The semantic conditions are in the original paper, but in essence they are that an interpretation I satisfies
 label {graph} 
just when I(label) = graph.  I'm tempted to say, "duh.")

The second case is the tricky one, because it has the label actually changing the meaning of the triples in the graph. If we are to claim that graphs can be true in one context but not in another, then we have to change the 2004 semantics somehow in order to provide for this context sensitivity. This is where Antoine's approach and mine differ. His proposal allows the *meanings of URIs* to change with context (as well as being the names of the contexts themselves); mine only allows relations to have an extra parameter. The 'context' is then this extra parameter which allows truthvalues of triples to appear to change, by treating them as quads; but the URIs remain having a global meaning.  Antoine's semantics requires adding a context mapping to interpretations, so that every URI defines a potentially different interpretation context for every other URI; mine requires allowing the EXT mapping on RDF properties to admit triples as well as pairs. Neither of them change current RDF graph meanings, but they extend this to datasets differently. Mine is a semantic extension to RDF, while Antoine's is a kind of semantic un-extension: it gives a weaker meaning than the RDF semantics does. 

OK, more later. I just wanted to get these distinctions out into the open. My main point is that these are *different*, and to suggest that we should provide ways to distinguish them. One way, for example, might be to give TriG Antoine's semantics (which does not overly interfere with the first case) and to give N-Quads my semantics, and think of (or choose) another syntax for the third case. (BTW, in my earlier email I suggested the use of the + instead of dot to distinguish the 'contextual' case from the plain RDF triple case. This makes sense in my proposal, but AFAIKS not in Antoine's. it allows case 2 to be mixed with case 1 as two kinds of data in a single dataset. Maybe this much flexibility is overkill, however.)

There are many other issues, like how to distinguish graphs from graph containers; whether we are naming/labeling the graph shown or some other, larger, graph; whether it is good to use RDF to itself describe the pragmatic or semantic alternatives;  and how to combine these various senses if we need to. But I think at least keeping them separate is a useful way to move forward and avoid some of the, er, philosophical disputes. 

Pat

------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973   
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes

Received on Tuesday, 6 March 2012 06:56:29 UTC