Re: RDF dataset semantics again from Ivan Herman on 2012-08-21 (public-rdf-wg@w3.org from August 2012)

From: Ivan Herman <ivan@w3.org>
Date: Tue, 21 Aug 2012 11:39:24 +0200
To: Antoine Zimmermann <antoine.zimmermann@emse.fr>
Cc: RDF WG <public-rdf-wg@w3.org>
Message-Id: <E99B59F4-5DC8-442D-9885-4459945145B1@w3.org>
On Aug 21, 2012, at 11:25 , Antoine Zimmermann wrote:

> Ivan,
> 
> 
> Le 20/08/2012 18:45, Ivan Herman a écrit :
>> Antoine,
>> 
>> Thanks.
>> 
>> I try to separate three issues here.
>> 
>> 1. Many semantics
>> 
>> It is clear that there are various 'semantics' that can be attached
>> to datasets, and they all have their particular difficulties. The
>> operative word here is 'various'...
>> 
>> Because there are many of those, the next question is who chooses
>> among those and how.
> 
> Is there really *many* of those?

Well... wasn't it you (or was it Pierre-Antoine?) who came up, on some wiki page, with around five? 

> 
> 
>> The current document does refer to one
>> alternative that was discussed in the working group, namely to attach
>> types to the graph names. Ie, your first example could be achieved by
>> adding
>> 
>> :year1960 rdf:type ex:MergeSemantics .
> > :year2000 rdf:type ex:MergeSemantics etc.
> 
> This is possible but it assumes that there is a strong connection between what the graph IRI denotes and the graph itself, at least at the place (context) where these triples are asserted.
> 
> In any case, my suggestion is indeed to have a syntactic indicator (like, e.g., ex:MergeSemantics) of what the semantic is. However, I'm not sure we have use cases that would mix several semantics in the same dataset (that is, I'd rather have the semantics fixed on a per dataset basis, rather than on a per named graph basis).
> 
> 

Yes, that is a good point. Which would rather go in direction of a file-level directive. I think what we have to decide first is that we should have some indication somewhere, and worry about the syntactic details later.

The problem with a directive is that it cannot be expressed in RDF. Ie, it becomes syntax dependent and that is a major drawback in my view.


>> to the default graph, defining that semantics along the lines of
>> extending the default graph by a merge of all graphs and let the
>> traditional semantics go (we did discuss this approach at some point,
>> if you remember).
>> 
>> I am not 100% sure this typing is perfect and good, but it is a
>> relatively clear way of doing this. Other options that were discussed
>> were to add turtle-style declarations to TriG instead
>> 
>> @semantics :year1960 ex:MergeSemantics
>> 
>> but that would be very syntax specific, which is not that good
>> either.
>> 
>> Reading your mail I saw you refer (if I understand well what you say)
>> to the usage of the HTTP return header to indicate the required
>> semantics for the dataset. This may actually be a working approach,
>> in theory, though I am not sure what would be done with local files.
>> Also, we run into one negative aspects of the HTTPRange14 story:
>> end-users may not necessary have the knowledge and/or the
>> authorization to set the HTTP return header. That would be a serious
>> obstacle.
> 
> I was referring to MIME type declaration as an analogy, along with other ways of declaring how the content of resources should be understood on the Web. Taking this for granted, I was arguing that it's only natural to provide the means to declare how to understand a dataset. This should not be put in the HTTP header.
> 

O.k. So if it is not the HTTP header then we fall back to the previous mechanisms...

> 
>> (B.t.w., you claim that "SPARQL queries have a way to ask for a
>> particular regime". Is that correct? AFAIK the choice of the
>> entailment regime in SPARQL is out of band; SPARQL endpoint may
>> publicize different URI-s for the different regimes. That barely
>> works for the general case.)
> 
> Yes, sorry. It's the SPARQL service that describes what regime it uses for each of its endpoints. Then the users are required to use the appropriate endpoint if they want a particular regime (and if it's available).
> 
> 
>> Bottomline: yes, we may have several semantics but how do we choose
>> between them? Is the typing, though not perfect, good enough for
>> now?
> 
> Like SPARQL service description, it's the one who provide the data/service that tells the world what semantics it assumes. But like one can take a SPARQL service description from the Web, modify it and republish it somewhere with false information about entailment regimes, one can also, in principle, change the indicator for the semantics.
> 
> 
>> 2. Choice of default
>> 
>> Because it seems to be hard to choose a particular semantics, I
>> personally believe we should have a default one that would be as
>> minimal as possible. I realize that we could go one step further and
>> do not define anything at all but, I must admit, I would feel
>> uncomfortable with that. This would require, at the minimum, that at
>> least some *possible* semantics were properly defined and, at the
>> moment, we seem to have difficulties even to define the quoting
>> semantics formally (and we may decide to drop the formalism
>> altogether). I am not optimistic that we could give a comprehensive
>> set of properly defined semantics; at this moment I would be happier
>> defining a very core on, and defining a mechanism whereby communities
>> may define the semantics they would/could use.
> >
>> 
>> 3. What is the default
>> 
>> The gut feeling we had (or some of us had) was that the quoting
>> semantics seems to be the simplest one hence taking that as a basis.
>> We may of course be wrong, but I have the impression that any choice
>> has its down sides. Note that by choosing the quoting semantics your
>> second example would indeed have no consequence at all (because there
>> is no default graph) but I take that as a feature not a bug: it means
>> that one can adopt more demanding semantic approaches without
>> violating anything.
> 
> Yes, there should be a default. The default should be the least constraining (the one that entails the least). In that case, it could be the "quoting semantics", provided that the meaning of the graph IRI is undetermined. All the other semantics are more constraining, leading to more entailments, so they could be made as extensions of the "quoting semantics" indeed.
> 
> 


Antoine, I have the impression that we are actually in agreement. The document we have put forward has two essential points:

- we would have a default semantics in the form of the quoting semantics (and whether the mathematical formalism is the right one or not, and whether we need a mathematical formalism at all becomes a secondary issues)
- we need some extension points, ie, a mechanism whereby the author/owner of the data can convey the information if another semantics is to be used. At this moment I do not see _any_ such mechanism that would perfect, the typing approach that is in the document is suboptimal, but maybe suboptimal is good enough.

And I have the feeling that we actually agree on these.

Where we may have a disagreement is whether this WG should formally define other semantics beyond the quoting one. I am not against it, actually, but I would prefer _not_ to take a decision on that but, rather, finalize the rest, then look at the calendar and the available manpower, and make a decision at that point only. (And, I do not hide that, currently I do not believe this WG will have the time and energy to do it. But I would be very pleased if proven wrong.)

Cheers

Ivan



> -AZ
> 
>> 
>> Cheers
>> 
>> Ivan
>> 
>> 
>> On Aug 20, 2012, at 16:02 , Antoine Zimmermann wrote:
>> 
>>> Dear all,
>>> 
>>> 
>>> ==Post scriptum:== Sorry for the long email. *In summary:*  I
>>> describe 3 different families of datasets semantics, I argue that
>>> there are important use cases for each of them, I'd like that all
>>> semantics are standardised with a mechanism to describe what
>>> semantics is assumed when exchanging datasets. There are more
>>> arguments on this at the end if you want to skip the discussion on
>>> the semantics. ====End of PS=====
>>> 
>>> 
>>> I come back to the topic of formal semantics for RDF datasets. I
>>> can see that there are two issues that are almost orthogonal:
>>> 
>>> 1. how the semantics of the triples inside the named graphs work.
>>> 2. how the graph "names" relate to the graph inside the
>>> (name,graph) pairs.
>>> 
>>> 
>>> To discuss this, I'll use the following example (do not bother the
>>> meaning of the classes and properties, I just try to make an
>>> example that looks a little realistic):
>>> 
>>> 
>>> # == EXAMPLE STARTS HERE == :year1960  dc:date  "1960"^^xsd:gYear;
>>> :endorsed  true . :year2000  dc:date  "2000"^^xsd:gYear;  :endorsed
>>> true . :year2012  dc:date  "2012"^^xsd:gYear;  :endorsed  true .
>>> :myth  :endorsed  false .
>>> 
>>> :year1960 { ex:MarilynMonroe  a  ex:LivingPerson . ex:LivingPerson
>>> owl:disjointWih  ex:DeadPerson . } :year2000 { ex:MarilynMonroe  a
>>> ex:DeadPerson . ex:DeadPerson  owl:disjointWih  ex:LivingPerson .
>>> } :year2012 { ex:MarilynMonroe  a  ex:DeceasedPerson .
>>> ex:DeceasedPerson  owl:equivalentClass  ex:DeadPerson . } :myth {
>>> ex:MarilynMonroe  ex:livesIn  ex:desertIsland . ex:livesIn
>>> rdfs:domain  ex:LivingPerson . } # == EXAMPLE ENDS HERE ==
>>> 
>>> 
>>> Wrt item 1 above, there are essentially 3 cases:
>>> 
>>> a) The dataset simply is an RDF graph where the triples have been
>>> simply partitioned. An interpretation of that dataset is an
>>> interpretation of the graph made of all the triples found in all
>>> the named graphs and the default graph. Depending on what is
>>> decided about item 2 above, there can be additional semantic
>>> constraint wrt what the graph IRIs denote, but there could be no
>>> constraint either, so item 1 and 2 are essentially orthogonal
>>> issues in this case. Applications use the partitioning mechanism as
>>> they wish, e.g., for optimisation, for documentation... If such is
>>> the semantics of datasets, then the example is inconsistent, so it
>>> entails all possible datasets.
>>> 
>>> 
>>> b) The dataset is interpreted in the same way as an RDF graph,
>>> where the default graph must be true and the<name,graph>  pairs are
>>> interpreted as assertions that relate the name to the graph itself.
>>> The actual relationship is to be determined, but what matters here
>>> is the syntax of the graph. It matters that the term
>>> ex:DeceasedPerson is used, not that the person denoted by
>>> ex:MarilynMonroe is dead. It is essentially the "quoting"
>>> semantics. The entailments depend on what is the relationship
>>> between the graph IRI and the graph, but a typical case is when the
>>> graph IRI denotes the graph, in which case, the example does not
>>> entail:
>>> 
>>> :year2012 { ex:MarilynMonroe  a  ex:DeadPerson . }
>>> 
>>> neither does it entail:
>>> 
>>> :myth { ex:MarilynMonroe  a  ex:LivingPerson . }
>>> 
>>> In this case, no conclusion are ever drawn from any assertion put
>>> inside a named graph.
>>> 
>>> 
>>> c) Each named graphs describe a world according to the graph IRI.
>>> In the example, the world according to :myth is that
>>> ex:MarilynMonroe is living somewhere. What matters is the truth of
>>> the assertions rather than the fact that the term "deceased" or
>>> "dead" was used. So one can draw the conclusion that: - *in
>>> :year1960*, ex:MarilynMonroe is not a ex:DeadPerson; - *in
>>> :year2012*, ex:MarilynMonroe is a ex:DeadPerson etc. In this case,
>>> the possibilities for what's the relationship between the graph IRI
>>> and the graph are more limited than in the other case. For
>>> instance, if the IRI must be intrepeted as the graph itself, then
>>> it prevents a lot of inferences.
>>> 
>>> 
>>> 
>>> I can see use cases for each of these semantics. a- If one is
>>> managing data that are verified facts, then one would like that all
>>> of the triples are true. Yet, they still have reasons to split the
>>> data in different parts, allowing users to query them separately
>>> with SPARQL GRAPH keywords. b- for a Semweb search engin exchanging
>>> the dump of its crawl, it makes sense to have an accurate "quote"
>>> of has been crawled. c- for situation regarding temporal evolution
>>> of facts, integration of variously trusted sources, tracking
>>> provenance of inferred knowledge, etc...
>>> 
>>> 
>>> I find odd that semantics b is retained as the only valid one in
>>> the "RDF graph identification" proposal. It's sweeping away several
>>> Priority A use cases, with some of the Priority B too.
>>> 
>>> Also, the condition ∀i: I(ui) = Gi is problematic. At first, it
>>> seems to be natural to say that the graph IRI RDF-denotes the
>>> graph. But:
>>> 
>>> http://www.w3.org/2011/rdf-wg/meeting/2011-04-14#resolution_1
>>> 
>>> "RESOLVED: Named Graphs in SPARQL associate IRIs and graphs *but*
>>> they do not necessarily "name" graphs in the strict model-theoretic
>>> sense. A SPARQL Dataset does not establish graphs as referents of
>>> IRIs (relevant to ISSUE-30)".
>>> 
>>> I know this resolution is about SPARQL datasets, and it's not
>>> necessarily applying to whatever structure we come up with in RDF,
>>> but one of the Priority A use cases is to be able to dump a SPARQL
>>> store. With this resolution, there is apparently a clash between
>>> the use case requirement and the semantic condition.
>>> 
>>> 
>>> My proposal is to define several recommended semantics and allow
>>> the concrete syntax to declare in a document what semantics is
>>> assumed when exchanging a dataset.
>>> 
>>> I find this idea appealing because it is in line with the fact that
>>> information carried by HTTP is accompanied by a self description of
>>> how it should be understood. For instance, we have MIME types, we
>>> have<!DOCTYPE>  declarations, etc. Since RDF is not a purely
>>> syntactical datastructure, it makes sense that it carries with it a
>>> reference to the semantics it uses. Such practices of referencing
>>> the MIME type, charset, doctype, schema, etc have been a key
>>> enabler of interoperability on the Web. Why not extend the pattern
>>> to the formal semantics? BTW, SPARQL services have a way to tell
>>> what inferrence regime they support, and SPARQL queries have a way
>>> to ask for a particular regime. I pretend that my proposal is
>>> simply in agreement with already accepted notions in the SPARQL
>>> world.
>>> 
>>> 
>>> Best, -- Antoine Zimmermann ISCOD / LSTI - Institut Henri Fayol
>>> École Nationale Supérieure des Mines de Saint-Étienne 158 cours
>>> Fauriel 42023 Saint-Étienne Cedex 2 France Tél:+33(0)4 77 42 66 03
>>> Fax:+33(0)4 77 42 66 66 http://zimmer.aprilfoolsreview.com/
>>> 
>> 
>> 
>> ---- Ivan Herman, W3C Semantic Web Activity Lead Home:
>> http://www.w3.org/People/Ivan/ mobile: +31-641044153 FOAF:
>> http://www.ivan-herman.net/foaf.rdf
>> 
>> 
>> 
>> 
>> 
>> 
> 
> -- 
> Antoine Zimmermann
> ISCOD / LSTI - Institut Henri Fayol
> École Nationale Supérieure des Mines de Saint-Étienne
> 158 cours Fauriel
> 42023 Saint-Étienne Cedex 2
> France
> Tél:+33(0)4 77 42 66 03
> Fax:+33(0)4 77 42 66 66
> http://zimmer.aprilfoolsreview.com/


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Tuesday, 21 August 2012 09:39:48 UTC