Re: RDF Semantics - Intuitive summary needs to be scoped to interpretations (ISSUE-149) from David Booth on 2013-10-28 (www-archive@w3.org from October 2013)

From: David Booth <david@dbooth.org>
Date: Mon, 28 Oct 2013 18:09:06 -0400
To: Pat Hayes <phayes@ihmc.us>
CC: Antoine Zimmermann <antoine.zimmermann@emse.fr>, www-archive <www-archive@w3.org>, "Peter F. Patel-Schneider" <pfpschneider@gmail.com>, Ivan Herman <ivan@w3.org>, Sandro Hawke <sandro@w3.org>
Message-ID: <526EE082.3010209@dbooth.org>
Hi Pat,

On 10/20/2013 04:31 AM, Pat Hayes wrote:
> David, greetings.
>
> Most of what you write in this message is completely uncontroversial
> and I would entirely agree with it. Rather than respond point by
> point, let me try to summarize.
>
> 1. People who publish RDF (or indeed any other) content may have
> different ideas about what IRIs mean, and the readers or users of
> this data may also have different ideas about what the IRIs mean.
> Call this "mismatch".

Yes.  Furthermore: (a) this is unavoidable in the long run; and (b) this 
can be modeled very nicely by the idea that they have different 
interpretations in mind.

2. Even when the publishers and users of RDF
> share a common understanding of what IRIs mean, the actual RDF will
> not be enough to formally pin down this mutual understanding, so that
> the RDF (considered in isolation from other possible sources of
> meaning) will be satisfied by 'nonstandard' interpretations which do
> not conform to this shared mutual understanding. Call this
> "underdetermination".

True enough, but: (a) underdetermination is not what I'm mainly talking 
about; (b) I think "unintended" would be a more accurate 
characterization than "nonstandard"; and (c) the notion that there is a 
common, pre-existing "mutual understanding" of what IRIs mean is 
perilous, because a central problem in this whole business is the 
problem of how IRIs are supposed to *become* associated with their 
intended denotations.  In any case, when this occurs we can say that the 
intended interpretations of a graph are a proper subset of the 
satisfying interpretations.

Finally, in the spirit of disallowing any "then a miracle occurs" steps,
http://blog.stackoverflow.com/wp-content/uploads/then-a-miracle-occurs-cartoon.png
we can reasonably assume that the importance of the difference between 
the intended interpretations and the satisfying interpretations is 
minimal, because: (a) for scalability in the Semantic Web URI 
definitions must rely on description rather than ostension;
http://en.wikipedia.org/wiki/Ostensive_definition
and (b) at least in principle, anything that can be described in, say, 
English prose could instead be described in RDF.

3. In some cases, the difference referred to in
> (1) may be so great that different pieces of published content are
> mutually inconsistent. Let me call this "divergence".

Yes, #1 leads to divergence.

4. It is also
> possible that two publishers of RDF content might have perfectly
> aligned notions of what all the IRIs mean, but simply disagree
> concerning the facts. Call this "disagreement".

Yes.

>
> I have deliberately avoided the word "ambiguity", because it is
> ambiguous. You and I agreed long ago that RDF – probably all data on
> the Web – is inherently ambiguous in the strict sense that it does
> not pin down a unique satisfying interpretation, ie it is
> underdetermined. We agreed that some of the TAG publications on
> "uniqueness of identification" were conceptually faulty in the way
> they were worded, since they seem to suggest that this unachievable
> goal is necessary to Web operation.

Right.  And it is also not necessary to Semantic Web operation.  Do we 
agree on that as well?  (That may take more explanation.)

> Underdermination is indeed
> inevitable. But "ambiguity" can be taken to imply mismatch, and this
> is *not* inevitable. And even a mismatch does not inevitably lead to
> divergence, or to any detectable inconsistencies between different
> usages of an IRI.

What do you mean by inevitable?  I agree that at any point in time, 
there is not necessarily a mismatch or inconsistency.  But AFAICT, the 
trend is inevitably *toward* mismatch as more statements are published, 
assuming that: (a) parties publish data independently (without knowledge 
of each other); and (b) the URI definition is not continually modified 
to track newly published data that uses the URI.  Do you agree?  If not, 
how do you think divergence can be avoided?

>
> Divergence and disagreement are formally indistinguishable: they both
> give rise to contradictions. For example, Alice publishes Everest was
> first climbed in 1953 Bob publishes Everest was first climbed in
> 1954 and with enough extra stuff about uniqueness of dates of first
> climbs, we can derive a formal contradiction, let us suppose. Now, it
> might be that Bob is using "Everest" to refer to K2, in which case we
> have divergence; or he might just be wrong about the date Hilary and
> Tensing made their historic climb, in which case we have a
> disagreement. In the first case, both Alice and Bob have their facts
> straight, but they are struggling over the referent of a name; in the
> second case, Alice is right and Bob is wrong, but at least they both
> know what they are talking about. Model-theoretic semantics isn't
> able to usefully distinguish these two cases: all it can tell us is
> that the things that Alice and Bob actually publish are (with some
> extra assumptions) mutually inconsistent, for some reason. It does
> not tell us what the reason is.

Agreed.  But there is an important practical difference between 
divergence and disagreement, because if one can determine that the 
contradiction is due to divergence, and the two source graphs of data 
were kept separate, then both graphs can still be used by "splitting" 
the resource identity to use two different URIs for the different 
notions that are denoted in the two graphs.  In contrast, if the problem 
was disagreement, then the user of those graphs will want to decide 
which one is correct and discard the other as erroneous, or at least 
discard the erroneous assertion.

Here's a little more on what I mean by "splitting":
http://dbooth.org/2010/ambiguity/paper.html#splitting

>
> So, to sum up: published RDF content typically (perhaps always) has
> many satisfying interpretations, ie it underdetermines its intended
> meaning. Also, RDF from multiple sources may be mutually
> inconsistent, ie be such that no interpretation satisfies it all.
> There can be several reasons for this, including divergence of
> intended meanings of IRIs and simple factual disagreements. But note
> that when an inconsistency is detectable between what Alice and Bob
> publish, then *something* is not right about that mutual publication.
> Either they disagree about the facts of the matter, or they disagree
> about what IRIs denote, or they have mutually incompatible ways of
> describing the world. I do not mean to imply that one of them is
> wrong and the other right (though that may be likely), only that they
> do actually in some way clash in what they are saying. As a consumer
> of their data, I would be obliged to choose between them, to make
> decisions about what to accept and what to reject.

No, not quite.  If the problem is disagreement then yes, you would have 
to choose between the source graphs.  But if the problem is divergence 
then you have to do some more work -- resource identity splitting -- but 
can still use both source graphs after splitting.  This is an important 
difference that is lost if one lumps disagreement and divergence together.

>
> The intuitive picture (not part of the normative semantics document,
> but intended to be understood by readers) is that the actual world
> being described by RDF data is itself one of the interpretations,

That strikes me as a naive, misleading and not very helpful intuition to 
promote, because: (a) RDF data does not generally describe the real 
world, it describes a particular *conceptualization* of the real world 
-- an *approximation* that is suitable for certain purposes; (b) it 
implicitly lumps divergence in with disagreement; and (c) it minimizes 
the relevance of multiple interpretations.  It also more subtly places 
the focus on real world truth instead of usefulness, and IMO that is the 
wrong engineering criterion to use.  Real world truth is a means to an 
end -- not the end itself.  The important criterion is *usefulness*.

An example I've often used to illustrate this is map data that models 
the world as flat.  (I'm using the word "model" here in the generic 
English or computer science sense -- not in the model theory sense.) 
Clearly the real world is not flat, i.e., a 2D conceptualization of the 
world clearly is not the real world, so in a strict sense the data may 
be "wrong".  And for applications such as calculating rocket 
trajectories or airplane flight paths, such data may be completely 
ususable.  But for automobile navigation purposes, it may be good 
enough, and far simpler -- and thus *better* -- than "correct" 3D data. 
  One may claim that such 2D data does not inherently need to be 
"wrong", if one carefully crafts the data and semantic claims about it, 
  and that may be true.  But bending over backward to craft the data 
that way, just so that it won't cause semantic contradictions when used 
in applications for which it was **not intended**, has a cost also.  And 
while it is certainly nice when authors craft their data to be usable in 
applications far outside of the data's target application domain, I do 
not believe that we should shame authors who fail to do so.  I think it 
is much more important that we: (a) encourage people to publish RDF data 
at all; (b) help the Semantic Web community understand how multiple 
interpretations provides a useful way to think about data that is 
inconsistent, when merged, due to divergence; and (c) help them learn 
how to deal with it.

Finally, (rhetorically) what does it even mean to say that one of the 
interpretations is the real world?  Let's take as an example, 
http://example.org/toucan , which Ian Davis has used both to denote a 
web page and a toucan:
http://blog.iandavis.com/2010/11/04/is-303-really-necessary/
Before anyone complains about that example, please note that it is 
really just equivalent to a case of divergence, so if you don't like 
that particular example we could choose another, but the analysis would 
be exactly the same.

Some applications ("web-page applications") care about web pages and 
assume interpretations in which http://example.org/toucan maps to a web 
page and has web page-ish properties.  In those applications we can 
imagine that the URI maps to the real world notion of a specific web 
page (whatever that means).  These applications do not detect any 
inconsistencies in the data because they don't employ any assertions 
about birds or the idea that birds are disjoint from web pages.

Other applications ("bird applications") may care about birds, and 
assume interpretations in which that URI maps to a toucan in the real 
world.  These applications also do not detect any inconsistencies, for 
similar reasons.

And still other applications ("bird-and-web-page applications") may care 
about both birds and web pages, employ additional data about birds and 
web pages -- including an assertion that says that they are disjoint -- 
and hence may find that URI unusable (unless they split it), because it 
conflates the toucan with the web page, and thus causes a logical 
contradiction.

Which, if any, of the interpretations that these applications use are 
the **real world**?  Probably not the interpretations used by either the 
bird applications or the the web-page applications.  Possibly one that 
is used by the bird-and-web-page applications.  But more likely *none* 
of them: most likely *all* of these applications assume interpretations 
that, when you dig deep enough to examine, correspond only 
*approximately* to the real world, but in fact differ from the real 
world in ways that would be revealed by the addition of more facts -- 
facts that those applications don't use or care about, and that may not 
even yet be known to science.

> and
> that the bare word "truth" – as when we might say, yes it is *true*
> that Everest was first climbed in 1953 – refers to this real world,
> but uses the same recursive analysis of how truth is determined from
> a bare interpretation mapping – the same "truth conditions". Such a
> picture is an integral part of how to relate the model theory to
> other semantic conditions on RDF, such as those arising from
> connections between RDF data and natural language texts or images.
> But as I say, this is not part of the normative RDF semantics, which
> is solely concerned with defining entailment relationships between
> RDF graphs.
>
> OK so far? Because all of this is how the RDF semantics views the
> world of RDF Web publication. I have used the terms 'satisfy',
> 'interpretation' and 'inconsistent' here exactly as they are defined
> in the formal semantics.

Yes, excellent.

>
> Now, you seem to want to insist that there is something else, some
> other way to use the formal semantic machinery, which somehow goes
> beyond or provides some kind of alternative to this picture. Can you
> say what it is, without using meaningless rhetoric such as
> "single-interpretation assumption" or "agnostic" ? What is this
> "other valid way" to think about the RDF semantics?

Sorry if those phrases sound meaningless to you.  I suspect there are at 
least a few others who understand them, but I suppose one person's 
useful insight is another person's meaningless rhetoric, so I'll try to 
find other phrasings that I hope will be more helpful to you.

The other way to think of the RDF Semantics is in terms of *multiple* 
interpretations, instead of attempting to assume or impose a single 
"real world" interpretation.  By this I mean, for example, that:

  - Two different graph authors may have different sets of intended 
interpretations in mind when they publish their RDF graphs, and the same 
URI may indeed denote different resources in those interpretations. 
This of course is not desirable, but it is inevitable, and it reflects 
the actual state of affairs far better than naively assuming that graph 
authors all have the same real world interpretation in mind.

  - Those RDF graphs may be useful -- and work fine -- for different 
classes of applications that (in essence) assume different 
interpretations.  I.e., different applications have different 
conceptualizations of the world; those conceptualizations correspond to 
interpretations.

  - The most accurate way to understand a graph is to interpret it in 
the way that the author intended it to be interpreted.  Since we have no 
other reliable way of knowing what that might be, we can assume that the 
author's intended interpretations for a graph are a subset of the 
graph's **satisfying interpretations**.  I.e., we take the graph's 
meaning at face value, rather than attempting to interpret it according 
to some hidden, assumed "real world" interpretation.

  - The most sensible answer to the question "What resource does URI U 
denote in graph G?" would be either "whatever it denotes in G's 
satisfying interpretations" or "whatever it denotes in the author's 
intended interpretations", but *not* "whatever it denotes in the 'real 
world'".  The "real world" interpretation is largely irrelevant -- both 
to the formal semantics and to understanding how the Semantic Web 
*actually* works.

Some benefits of looking at the formal semantics this way: (a) it 
corresponds more closely to actual practice than assuming that all 
authors are talking about the same real world; (b) it helps to explain 
the difference between divergence and disagreement; (c) it helps to 
explain how to deal with divergence when it happens; (d) it allows more 
data to be recognized as useful (even if it isn't 100% "correct"), 
because it allows more graphs to be treated as true, whereas if we think 
in terms of a single, real world interpretation, then nearly every RDF 
graph would be false, and false graphs aren't very useful, because they 
entail everything; (e) it provides a formal framework for understanding 
the fact that different applications care about different resource 
identity distinctions (and this may give rise to divergence); (f) it at 
last provides a sensible way to formalize and understand the 
httpRange-14 issue as a case of divergence; and (g) it provides a useful 
and practical foundation for understanding the practical use of 
owl:sameAs, rather than simply lamenting its supposed "abuse".  Is this 
enough?

Is this making any more sense to you?   Have I explained myself in 
sufficient detail, or do you still think that "David . . . does not 
properly understand the intuitive foundations of semantics" and my 
points are mere "inanity", as you previously concluded?
http://lists.w3.org/Archives/Public/public-rdf-wg/2013Oct/0079.html

(And BTW, posting such libelous remarks in a forum to which I am unable 
to respond -- since I cannot post to that list -- was extremely unfair, 
and rather upsetting to see.  I have a great deal of respect for your 
insight and contributions, and I do not appreciate being maligned that 
way, even if you do find my points frustratingly difficult to understand 
sometimes.)

And do you *still* think I merely need to go read a book on model 
theory, or have we now (I hope) got past that?  If not, what aspects of 
model theory do you still think I misunderstand?  I've found your 
explanations excellent, BTW, but I wouldn't expect you to personally 
explain everything that you think I need to know.  I'd be happy to read 
up further on specific aspects that you think are critical to this 
discussion.

The bottom line here is that some of the statements -- and intuition -- 
in the existing RDF drafts are just plain *wrong* and need to be 
corrected.  In particular, the statement in RDF Concepts that says "IRIs 
have global scope: Two different appearances of an IRI denote the same 
resource" is just factually *wrong*.   IRIs are indeed *intended* to 
everywhere denote the same resource, and that's a perfectly good goal, 
even if it is inherently unachievable.  But it is a *goal* -- it is not 
the reality.

Best wishes,
David
Received on Monday, 28 October 2013 22:09:35 UTC