RDF dataset semantics again from Antoine Zimmermann on 2012-08-20 (public-rdf-wg@w3.org from August 2012)

From: Antoine Zimmermann <antoine.zimmermann@emse.fr>
Date: Mon, 20 Aug 2012 16:02:47 +0200
To: RDF WG <public-rdf-wg@w3.org>
Message-ID: <50324387.90906@emse.fr>
Dear all,


==Post scriptum:==
Sorry for the long email.
*In summary:*  I describe 3 different families of datasets semantics, I 
argue that there are important use cases for each of them, I'd like that 
all semantics are standardised with a mechanism to describe what 
semantics is assumed when exchanging datasets. There are more arguments 
on this at the end if you want to skip the discussion on the semantics.
====End of PS=====


I come back to the topic of formal semantics for RDF datasets. I can see 
that there are two issues that are almost orthogonal:

  1. how the semantics of the triples inside the named graphs work.
  2. how the graph "names" relate to the graph inside the (name,graph) 
pairs.


To discuss this, I'll use the following example (do not bother the 
meaning of the classes and properties, I just try to make an example 
that looks a little realistic):


# == EXAMPLE STARTS HERE ==
:year1960  dc:date  "1960"^^xsd:gYear;  :endorsed  true .
:year2000  dc:date  "2000"^^xsd:gYear;  :endorsed  true .
:year2012  dc:date  "2012"^^xsd:gYear;  :endorsed  true .
:myth  :endorsed  false .

:year1960 {
   ex:MarilynMonroe  a  ex:LivingPerson .
   ex:LivingPerson  owl:disjointWih  ex:DeadPerson .
}
:year2000 {
   ex:MarilynMonroe  a  ex:DeadPerson .
   ex:DeadPerson  owl:disjointWih  ex:LivingPerson .
}
:year2012 {
   ex:MarilynMonroe  a  ex:DeceasedPerson .
   ex:DeceasedPerson  owl:equivalentClass  ex:DeadPerson .
}
:myth {
   ex:MarilynMonroe  ex:livesIn  ex:desertIsland .
   ex:livesIn  rdfs:domain  ex:LivingPerson .
}
# == EXAMPLE ENDS HERE ==


Wrt item 1 above, there are essentially 3 cases:

  a) The dataset simply is an RDF graph where the triples have been 
simply partitioned. An interpretation of that dataset is an 
interpretation of the graph made of all the triples found in all the 
named graphs and the default graph. Depending on what is decided about 
item 2 above, there can be additional semantic constraint wrt what the 
graph IRIs denote, but there could be no constraint either, so item 1 
and 2 are essentially orthogonal issues in this case.
Applications use the partitioning mechanism as they wish, e.g., for 
optimisation, for documentation...
If such is the semantics of datasets, then the example is inconsistent, 
so it entails all possible datasets.


  b) The dataset is interpreted in the same way as an RDF graph, where 
the default graph must be true and the <name,graph> pairs are 
interpreted as assertions that relate the name to the graph itself. The 
actual relationship is to be determined, but what matters here is the 
syntax of the graph. It matters that the term ex:DeceasedPerson is used, 
not that the person denoted by ex:MarilynMonroe is dead.
It is essentially the "quoting" semantics. The entailments depend on 
what is the relationship between the graph IRI and the graph, but a 
typical case is when the graph IRI denotes the graph, in which case, the 
example does not entail:

:year2012 {
   ex:MarilynMonroe  a  ex:DeadPerson .
}

neither does it entail:

:myth {
   ex:MarilynMonroe  a  ex:LivingPerson .
}

In this case, no conclusion are ever drawn from any assertion put inside 
a named graph.


  c) Each named graphs describe a world according to the graph IRI. In 
the example, the world according to :myth is that ex:MarilynMonroe is 
living somewhere. What matters is the truth of the assertions rather 
than the fact that the term "deceased" or "dead" was used.
So one can draw the conclusion that:
  - *in :year1960*, ex:MarilynMonroe is not a ex:DeadPerson;
  - *in :year2012*, ex:MarilynMonroe is a ex:DeadPerson
etc.
In this case, the possibilities for what's the relationship between the 
graph IRI and the graph are more limited than in the other case. For 
instance, if the IRI must be intrepeted as the graph itself, then it 
prevents a lot of inferences.



I can see use cases for each of these semantics.
  a- If one is managing data that are verified facts, then one would 
like that all of the triples are true. Yet, they still have reasons to 
split the data in different parts, allowing users to query them 
separately with SPARQL GRAPH keywords.
  b- for a Semweb search engin exchanging the dump of its crawl, it 
makes sense to have an accurate "quote" of has been crawled.
  c- for situation regarding temporal evolution of facts, integration of 
variously trusted sources, tracking provenance of inferred knowledge, etc...


I find odd that semantics b is retained as the only valid one in the 
"RDF graph identification" proposal. It's sweeping away several Priority 
A use cases, with some of the Priority B too.

Also, the condition ∀i: I(ui) = Gi is problematic. At first, it seems to 
be natural to say that the graph IRI RDF-denotes the graph. But:

http://www.w3.org/2011/rdf-wg/meeting/2011-04-14#resolution_1

"RESOLVED: Named Graphs in SPARQL associate IRIs and graphs *but* they 
do not necessarily "name" graphs in the strict model-theoretic sense. A 
SPARQL Dataset does not establish graphs as referents of IRIs (relevant 
to ISSUE-30)".

I know this resolution is about SPARQL datasets, and it's not 
necessarily applying to whatever structure we come up with in RDF, but 
one of the Priority A use cases is to be able to dump a SPARQL store. 
With this resolution, there is apparently a clash between the use case 
requirement and the semantic condition.


My proposal is to define several recommended semantics and allow the 
concrete syntax to declare in a document what semantics is assumed when 
exchanging a dataset.

I find this idea appealing because it is in line with the fact that 
information carried by HTTP is accompanied by a self description of how 
it should be understood. For instance, we have MIME types, we have 
<!DOCTYPE> declarations, etc. Since RDF is not a purely syntactical 
datastructure, it makes sense that it carries with it a reference to the 
semantics it uses.
Such practices of referencing the MIME type, charset, doctype, schema, 
etc have been a key enabler of interoperability on the Web. Why not 
extend the pattern to the formal semantics?
BTW, SPARQL services have a way to tell what inferrence regime they 
support, and SPARQL queries have a way to ask for a particular regime. I 
pretend that my proposal is simply in agreement with already accepted 
notions in the SPARQL world.


Best,
-- 
Antoine Zimmermann
ISCOD / LSTI - Institut Henri Fayol
École Nationale Supérieure des Mines de Saint-Étienne
158 cours Fauriel
42023 Saint-Étienne Cedex 2
France
Tél:+33(0)4 77 42 66 03
Fax:+33(0)4 77 42 66 66
http://zimmer.aprilfoolsreview.com/
Received on Monday, 20 August 2012 14:03:15 UTC