Re: dataset semantics from Pat Hayes on 2011-12-21 (public-rdf-wg@w3.org from December 2011)

From: Pat Hayes <phayes@ihmc.us>
Date: Wed, 21 Dec 2011 11:29:37 -0600
To: Antoine Zimmermann <antoine.zimmermann@emse.fr>
Cc: Richard Cyganiak <richard@cyganiak.de>, public-rdf-wg@w3.org
Message-Id: <091321C2-A15C-49A9-893F-C9A842DDFF19@ihmc.us>
On Dec 21, 2011, at 4:26 AM, Antoine Zimmermann wrote:

> 
> 
> Le 21/12/2011 07:27, Pat Hayes a écrit :
>> 
>> 
> > [skip]
>> 
>>> The meaning of an IRI is constrained by the triples in the graph in
>>> which it occurs.
>> 
>> That can be understood in two ways. One of them is correct, but
>> irrelevant to the discussion here; the other is relevant, but then
>> the claim is profoundly and dangerously wrong.
>> 
>> The first sense is, that the meaning of an IRI is determined (perhaps
>> in part) by what assertions are made using it, ie in RDF terms, by
>> what RDF graphs it occurs in. Yes, I think that is basically correct,
>> although its might be better to say, it is determined by the totality
>> of all documents in which it occurs. [1] However, with this sense, we
>> must take into account *all* the graphs in which the IRI occurs, or
>> at any rate all those which we trust or accept. (I know this gets
>> into the issue of how to adjudicate such trust, but let me leave that
>> aside for now. It is orthogonal to the present point.) So when we
>> look at two sources, both trusted, both using the IRI in question,
>> then both of them constrain the meaning of the IRI. It is the same
>> IRI in both (or perhaps many) graphs, not a different IRI in each
>> separate graph.
> 
> To assign a trust measure, you need be able to identify the set of triples (or the source) that you trust, and for this you need to compartiment the triples in different boxes, some of which should not be influencing the knowledge of the rest.

You need to keep them separate while you are evaluating them for trustworthiness. (And this evaluation might be more nuanced than trusting a whole source, eg it might involve trusting some statements but not others, etc..) Yes, of course. 

> But there may be boxes that you equally trust but that are in disagreement. In which case, what do you do? I say that you simply separate reasoning in different boxes, which is formalised by distinct interpretations of distinct "named" graphs.

You may separate reasoning into different boxes, yes. This is EXACTLY what the current RDF semantics supports. No change is required at all. Whatever triples you put into a given 'box' and then try to draw conclusions from, those are the ones currently constraining the meaning of the URIs in them. One does NOT get this behavior by having different interpretations, simultaneously, for different sets of triples. 

>> The second sense I can understand what you are saying here is exactly
>> this idea, that one and the same IRI might have a meaning in one
>> graph and a different meaning in a different graph. (Perhaps indeed,
>> that it *must* have a different meaning in a different graph? Note,
>> this is not the same as saying that one graph may be more trusted
>> than the other.) Taken to an extreme, this amounts to the claim that
>> each IRI has a whole spectrum of meanings, determined by the graph in
>> which it appears, and hence that every occurrence of it in a
>> different graph is, in effect, a distinct IRI.  It is difficult to
>> emphasise the extent to which this idea is wrong. It would follow for
>> example that RDF from two different graphs can never be combined to
>> draw any conclusion that could not be derived from one of the graphs
>> alone, that merging two graphs is never a valid operation, and many
>> other consequences that seem to me to be somewhat insane. But in any
>> case, even if we ignore RDF and its semantics, the whole Web is
>> predicated on the basic idea of IRIs as *global* identifiers, which
>> mean (in whatever sense of 'mean' one cares to adopt) the same thing,
>> wherever they are used. (Of course, reality is often scruffier than
>> the idealized design models described by our specifications, but at
>> least the specifications make this basic assumption.)
> 
> After making a carefully thought proposal, being told that it is immensely wrong is somewhat painful.

I understand. However, it is worth checking carefully to see if one has made a mistake. Ive made plenty of them myself. 

>  You seem to pretend that the dataset proposal is changing the semantics and behaviour of RDF.

Yes, it is. No pretense. The fact that you keep talking about 'context' is a signal, in fact, as this idea is completely absent from the 2004 specs. 

> Again (how many times should I repeat?), it does not change anything to RDF itself.  

I disagree. It does change it. It would, for example, make current reasoners invalid (if they were to attempt to draw conclusions using triples from more than one graph in a dataset.)

> Everything that is valid in RDF 2004 is valid again. Since datasets were introduced in SPARQL, people haven't stopped merging RDF graphs, although they could always put different RDF graphs in different boxes.

They always could. 

> But with this proposal, you can control what you merge and what you don't merge, in a standardised structure.

You always could control that. You seem to be under an impression (?) that the 2004 semantics *forces* one to merge graphs (??). But model theory never forces anything: it only tells you what the entailment consequences are when you do various things. You will I think find that the word MUST is not used in the 2004 semantics document anywhere.

> 
> 
>>> Go online, and look at what you find:
>>> 
>>> http://www.emse.fr/~zimmermann/data4pat1.rdf
>>> 
>>> This URL leads to a document where the
>>> IRI<http://www.ihmc.us/groups/phayes/>  denotes the number 1.
>> 
>> No. It leads to a document where the assertion is made that I, Pat
>> hayes, am identical to the number 1. This assertion is, I am pleased
>> to report, false. Nevertheless, that is what the document says. If,
>> on the other hand, the UIR in question were interpreted as you say,
>> then it would be true, but vacuous, since it would be asserting that
>> 1=1.
> 
> What makes you think this URI identifies you?  If it was presented to me independently of the triples, I would have said it identifies a web page.

UM..perhaps I should have said, my web page. Sorry.

>  What makes you think it does not denote number 1?

Because a web page is not a number. Necessarily not a number, so I know this from first principles and do not need to investigate further.. 

> 
> What a URI denotes is a matter of opinion, and it's certainly decided by the one who publishes the triples (yes, it could be otherwise). But a system is not going to probe people's opinion on billions of URIs.  The only thing that a system can rely on is what is said in the triples.

No, the system can have some knowledge of the world built into it. It can for example know that 1 =/= 2. Only an extremely naive system would simply believe every triple it finds on the Web.

> And the triples say it's number 1.
> 
> 
>>> 
>>> Now, go to:
>>> 
>>> http://www.emse.fr/~zimmermann/data4pat1.rdf
>>> 
>>> In this document, the same IRI denotes number 2.
>> 
>> Again, no. It still denotes me, as it did in the first graph, but
>> this graph says that I am identical to the number 2. Taken together,
>> these have the entailment (in OWL) that the number 1 equals the
>> number 2. Which I hope we all agree is probably not the case;
>> nevertheless, they do indeed entail that, taken together. Whereas, if
>> that URI meant what you claim, these two graphs would have no
>> inferential connection with one another at all, since
>> the<http://www.ihmc.us/groups/phayes/>  in the first one would refer
>> to something different from the<http://www.ihmc.us/groups/phayes/>
>> in the second one.
>> 
>>> 
>>> Eventually, a web crawler will index these two documents and
>>> without context, it won't do anything useful.
>> 
>> Hopefully, it might detect the inconsistency. I have no idea what
>> help "context" would be. (Im not even sure what you mean by the word
>> in this, er, context.)
> 
> It detects the inconsistency. Then what?

That, frankly, is not my problem. The issue I want to clarify is that there is an actual inconsistency there to be detected. That is all the semantics' job is, to clarify and codify the grounds for such inconsistencies. With your multiple-interpretations semantics, there is no inconsistency. 

> Forget about "context". The crawler puts the two documents in different "named graphs". Then, it depends.

Right. This is a very interesting and complicated area, of course, and i dont want to imply otherwise. But the semantics role is only to sit, passively, while you do all that clever stuff, and tell you when you get any inconsistencies and what is entailed by whatever triples or graphs you happen to be considering at the time.   I firmly believe that you don't WANT to have an altered semantics underlying all this truth-detecting machinery: you want the basic semantics that the 2004 specs provide. Nothing in that semantics even suggests that you cannot consider a set of triples in isolation from other sets of triples, and ask what it entails  by itself.

> You can simply use SPARQL, with or without OWL inference regime, and get useful answers FROM a/some particular graph(s). If the crawler is Sindice's, reasoning over the first document will make a merge (yes, RDF merge) of what it gets from <http://www.ihmc.us/groups/phayes/> (which it may have crawled already and put in the appropriate "named" graph) and what it gets from <http://www.w3.org/2002/07/owl#sameAs>, and materialise inferences on that. The result is consistent, within its well delimited box. Same for the second document: it results in consistent inferences, inside the delimited box. Other reasoners may do it differently, but what is important is that different RDF graphs are interpreted differently.

NO. The fact that you consider them in isolation for inference purposes does NOT mean they are interpreted differently. If they were interpreted differently, then they would not have the same entailments when you DO put them together. Merging would cease to be a semantically valid operation. 

> 
> 
>>> 
>>> Then go get:
>>> 
>>> http://www.emse.fr/~zimmermann/data4pat.rdf
>>> 
>>> Now, this document says that all IRIs denote the same thing.
>> 
>> It says that, indeed, and that is obviously false. It has a whole
>> host of very silly entailments. I havnt checked, but I bet it is
>> formally inconsistent, and that an OWL-Full reasoner would find a
>> contradiction quite rapidly. (An OWL-DL reasoner will spit it out at
>> parse time as illegal.). It is often the case that asserting
>> something obviously false entails a great deal of other nonsense.
>> So?
> 
> So the one giant graph composed of all triples is mostly wrong, contradictory, outdated, etc etc.

Yes, it is indeed. Remember, semantics does not guarantee correctness. It only tells you how to relate truth to a set of circumstances, or a possible world.

> People who want to do practical things with RDF don't want that a URI be defined by all documents that contain this URI.

I said, by all the documents *that you accept or trust*. Once you have decided what to "believe", then all of that is what constrains the meaning, for you. But if several sources all use one IRI, they are all talking about the same thing, even if what they say about it, taken together, is nonsensical or contradictory. If they weren't, there would be no contradictions in this mess in the first place. 

> And it is not just a question of having sufficient trust. Sometime, contradictory documents are equally useful and trusted, and one doesn't want to hand pick the right and wrong documents/triples. It has to be done in a systematic way, and one needs a structure to keep these things in.

OK, fine. The semantics does not address such matters. But what it does address is what it even means to say that there are contradictions at all. Consider a dataset containing two graphs which are mutually contradictory. For that sentence to even make sense, the graphs in it must be interpreted together. In your proposed multi-interpretation semantics, two graphs in a dataset cannot *ever* be contradictory. 

> 
>> 
>>> As, according to you, this thing is independent of the context, we
>>> can stop making reasoners :)
>> 
>> I can't even understand what this is supposed to mean, so I fail to
>> follow your intended point.
> 
> I mean, if the Web of Linked Data is to be interpreted as a single giant graph, then reasoners are trivial to implement: all possible triples are entailled by this graph.

Ah, I see. But reasoners are typically used to draw conclusions from part of the giant graph. 

> 
>> 
>> Pat
> 
> Now, let us get back to the original point: the thread is called "dataset semantics". In this thread, we assume that we have the notion of dataset, so there is no need arguing against that here. The question is whether we want the WG to define a formal semantics for it.

It already has a formal semantics, according to the 2004 specs. Nothing you say suggests any reason to change that. We just have to clarify what the semantic role of the graph 'labels' is, especially in the case where the default graph contains metadata.

> 
> Solution 1: we do not define a formal semantics. If this is what you advocate, then people can intepret URIs in a multitude of ways, which contradicts you arguments. So you are probably against this.
> 
> Solution 2: we propose a formal semantics. If we do that, the semantics must not contradict with widely deployed systems. We know that a common practice is to use the primary topic of an RDF document as the "name" of a "named" graph.

That is one common practice, yes. There are also others, apparently. 

> We have to live with this.

Agreed. With all of them :-)

> Another thing is that we know that merging all the "named" graphs in systems that grab data from all over the Web inevitably leads to an inconsistent graph. It is common that systems consider that the default graph entails all the "named" graphs (merge-as-default), but the opposite may also be true (default is universal truth). Considering these constraints, I made a proposal.
> If you don't like it, what's YOUR proposal?

I see no reason to change the semantic interpretation of the graphs at all. As you say, they may be mutually inconsistent. That is exactly what the 2004 semantics says, also. What change is needed here?

Pat

> 
> 
> Best,
> AZ.
> 
>> 
>> [1] However, this idea is by no means universally accepted. David
>> Booth, for example, has argued at length that the meaning of any IRI
>> should be determined by a single 'definitional' graph published by
>> the owner of the IRI. Others have said that the meaning is determined
>> by the intentions of the owner of the IRI, whether or not that
>> intention is made manifest in any Web source. And there are many
>> other positions out there.
>> 
>> ------------------------------------------------------------ IHMC
>> (850)434 8903 or (650)494 3973 40 South Alcaniz St.
>> (850)202 4416   office Pensacola                            (850)202
>> 4440   fax FL 32502                              (850)291 0667
>> mobile phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
>> 
>> 
>> 
>> 
>> 
>> 
> 
> -- 
> Antoine Zimmermann
> ISCOD / LSTI - Institut Henri Fayol
> École Nationale Supérieure des Mines de Saint-Étienne
> 158 cours Fauriel
> 42023 Saint-Étienne Cedex 2
> France
> Tél:+33(0)4 77 42 83 36
> Fax:+33(0)4 77 42 66 66
> http://zimmer.aprilfoolsreview.com/
> 
> 

------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973   
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Wednesday, 21 December 2011 17:30:27 UTC