Re: RDF dataset semantics again from Ivan Herman on 2012-08-20 (public-rdf-wg@w3.org from August 2012)

From: Ivan Herman <ivan@w3.org>
Date: Mon, 20 Aug 2012 18:45:02 +0200
To: Antoine Zimmermann <antoine.zimmermann@emse.fr>
Cc: RDF WG <public-rdf-wg@w3.org>
Message-Id: <544A4FF1-EF8D-428C-AE62-10EBE31BFAC8@w3.org>
Antoine,

Thanks.

I try to separate three issues here.

1. Many semantics 

It is clear that there are various 'semantics' that can be attached to datasets, and they all have their particular difficulties. The operative word here is 'various'...

Because there are many of those, the next question is who chooses among those and how. The current document does refer to one alternative that was discussed in the working group, namely to attach types to the graph names. Ie, your first example could be achieved by adding 

:year1960 rdf:type ex:MergeSemantics
:year2000 rdf:type ex:MergeSemantics
etc.

to the default graph, defining that semantics along the lines of extending the default graph by a merge of all graphs and let the traditional semantics go (we did discuss this approach at some point, if you remember).

I am not 100% sure this typing is perfect and good, but it is a relatively clear way of doing this. Other options that were discussed were to add turtle-style declarations to TriG instead

@semantics :year1960 ex:MergeSemantics

but that would be very syntax specific, which is not that good either.

Reading your mail I saw you refer (if I understand well what you say) to the usage of the HTTP return header to indicate the required semantics for the dataset. This may actually be a working approach, in theory, though I am not sure what would be done with local files. Also, we run into one negative aspects of the HTTPRange14 story: end-users may not necessary have the knowledge and/or the authorization to set the HTTP return header. That would be a serious obstacle.

(B.t.w., you claim that "SPARQL queries have a way to ask for a particular regime". Is that correct? AFAIK the choice of the entailment regime in SPARQL is out of band; SPARQL endpoint may publicize different URI-s for the different regimes. That barely works for the general case.)

Bottomline: yes, we may have several semantics but how do we choose between them? Is the typing, though not perfect, good enough for now? 

2. Choice of default

Because it seems to be hard to choose a particular semantics, I personally believe we should have a default one that would be as minimal as possible. I realize that we could go one step further and do not define anything at all but, I must admit, I would feel uncomfortable with that. This would require, at the minimum, that at least some *possible* semantics were properly defined and, at the moment, we seem to have difficulties even to define the quoting semantics formally (and we may decide to drop the formalism altogether). I am not optimistic that we could give a comprehensive set of properly defined semantics; at this moment I would be happier defining a very core on, and defining a mechanism whereby communities may define the semantics they would/could use.

3. What is the default

The gut feeling we had (or some of us had) was that the quoting semantics seems to be the simplest one hence taking that as a basis. We may of course be wrong, but I have the impression that any choice has its down sides. Note that by choosing the quoting semantics your second example would indeed have no consequence at all (because there is no default graph) but I take that as a feature not a bug: it means that one can adopt more demanding semantic approaches without violating anything.

Cheers

Ivan


On Aug 20, 2012, at 16:02 , Antoine Zimmermann wrote:

> Dear all,
> 
> 
> ==Post scriptum:==
> Sorry for the long email.
> *In summary:*  I describe 3 different families of datasets semantics, I argue that there are important use cases for each of them, I'd like that all semantics are standardised with a mechanism to describe what semantics is assumed when exchanging datasets. There are more arguments on this at the end if you want to skip the discussion on the semantics.
> ====End of PS=====
> 
> 
> I come back to the topic of formal semantics for RDF datasets. I can see that there are two issues that are almost orthogonal:
> 
> 1. how the semantics of the triples inside the named graphs work.
> 2. how the graph "names" relate to the graph inside the (name,graph) pairs.
> 
> 
> To discuss this, I'll use the following example (do not bother the meaning of the classes and properties, I just try to make an example that looks a little realistic):
> 
> 
> # == EXAMPLE STARTS HERE ==
> :year1960  dc:date  "1960"^^xsd:gYear;  :endorsed  true .
> :year2000  dc:date  "2000"^^xsd:gYear;  :endorsed  true .
> :year2012  dc:date  "2012"^^xsd:gYear;  :endorsed  true .
> :myth  :endorsed  false .
> 
> :year1960 {
>  ex:MarilynMonroe  a  ex:LivingPerson .
>  ex:LivingPerson  owl:disjointWih  ex:DeadPerson .
> }
> :year2000 {
>  ex:MarilynMonroe  a  ex:DeadPerson .
>  ex:DeadPerson  owl:disjointWih  ex:LivingPerson .
> }
> :year2012 {
>  ex:MarilynMonroe  a  ex:DeceasedPerson .
>  ex:DeceasedPerson  owl:equivalentClass  ex:DeadPerson .
> }
> :myth {
>  ex:MarilynMonroe  ex:livesIn  ex:desertIsland .
>  ex:livesIn  rdfs:domain  ex:LivingPerson .
> }
> # == EXAMPLE ENDS HERE ==
> 
> 
> Wrt item 1 above, there are essentially 3 cases:
> 
> a) The dataset simply is an RDF graph where the triples have been simply partitioned. An interpretation of that dataset is an interpretation of the graph made of all the triples found in all the named graphs and the default graph. Depending on what is decided about item 2 above, there can be additional semantic constraint wrt what the graph IRIs denote, but there could be no constraint either, so item 1 and 2 are essentially orthogonal issues in this case.
> Applications use the partitioning mechanism as they wish, e.g., for optimisation, for documentation...
> If such is the semantics of datasets, then the example is inconsistent, so it entails all possible datasets.
> 
> 
> b) The dataset is interpreted in the same way as an RDF graph, where the default graph must be true and the <name,graph> pairs are interpreted as assertions that relate the name to the graph itself. The actual relationship is to be determined, but what matters here is the syntax of the graph. It matters that the term ex:DeceasedPerson is used, not that the person denoted by ex:MarilynMonroe is dead.
> It is essentially the "quoting" semantics. The entailments depend on what is the relationship between the graph IRI and the graph, but a typical case is when the graph IRI denotes the graph, in which case, the example does not entail:
> 
> :year2012 {
>  ex:MarilynMonroe  a  ex:DeadPerson .
> }
> 
> neither does it entail:
> 
> :myth {
>  ex:MarilynMonroe  a  ex:LivingPerson .
> }
> 
> In this case, no conclusion are ever drawn from any assertion put inside a named graph.
> 
> 
> c) Each named graphs describe a world according to the graph IRI. In the example, the world according to :myth is that ex:MarilynMonroe is living somewhere. What matters is the truth of the assertions rather than the fact that the term "deceased" or "dead" was used.
> So one can draw the conclusion that:
> - *in :year1960*, ex:MarilynMonroe is not a ex:DeadPerson;
> - *in :year2012*, ex:MarilynMonroe is a ex:DeadPerson
> etc.
> In this case, the possibilities for what's the relationship between the graph IRI and the graph are more limited than in the other case. For instance, if the IRI must be intrepeted as the graph itself, then it prevents a lot of inferences.
> 
> 
> 
> I can see use cases for each of these semantics.
> a- If one is managing data that are verified facts, then one would like that all of the triples are true. Yet, they still have reasons to split the data in different parts, allowing users to query them separately with SPARQL GRAPH keywords.
> b- for a Semweb search engin exchanging the dump of its crawl, it makes sense to have an accurate "quote" of has been crawled.
> c- for situation regarding temporal evolution of facts, integration of variously trusted sources, tracking provenance of inferred knowledge, etc...
> 
> 
> I find odd that semantics b is retained as the only valid one in the "RDF graph identification" proposal. It's sweeping away several Priority A use cases, with some of the Priority B too.
> 
> Also, the condition ∀i: I(ui) = Gi is problematic. At first, it seems to be natural to say that the graph IRI RDF-denotes the graph. But:
> 
> http://www.w3.org/2011/rdf-wg/meeting/2011-04-14#resolution_1
> 
> "RESOLVED: Named Graphs in SPARQL associate IRIs and graphs *but* they do not necessarily "name" graphs in the strict model-theoretic sense. A SPARQL Dataset does not establish graphs as referents of IRIs (relevant to ISSUE-30)".
> 
> I know this resolution is about SPARQL datasets, and it's not necessarily applying to whatever structure we come up with in RDF, but one of the Priority A use cases is to be able to dump a SPARQL store. With this resolution, there is apparently a clash between the use case requirement and the semantic condition.
> 
> 
> My proposal is to define several recommended semantics and allow the concrete syntax to declare in a document what semantics is assumed when exchanging a dataset.
> 
> I find this idea appealing because it is in line with the fact that information carried by HTTP is accompanied by a self description of how it should be understood. For instance, we have MIME types, we have <!DOCTYPE> declarations, etc. Since RDF is not a purely syntactical datastructure, it makes sense that it carries with it a reference to the semantics it uses.
> Such practices of referencing the MIME type, charset, doctype, schema, etc have been a key enabler of interoperability on the Web. Why not extend the pattern to the formal semantics?
> BTW, SPARQL services have a way to tell what inferrence regime they support, and SPARQL queries have a way to ask for a particular regime. I pretend that my proposal is simply in agreement with already accepted notions in the SPARQL world.
> 
> 
> Best,
> -- 
> Antoine Zimmermann
> ISCOD / LSTI - Institut Henri Fayol
> École Nationale Supérieure des Mines de Saint-Étienne
> 158 cours Fauriel
> 42023 Saint-Étienne Cedex 2
> France
> Tél:+33(0)4 77 42 66 03
> Fax:+33(0)4 77 42 66 66
> http://zimmer.aprilfoolsreview.com/
> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Monday, 20 August 2012 16:45:26 UTC