Re: The resource identity guessing game vs. bounded ambiguity [was Re: Comments on "SPARQL 1.1 Uniform HTTP Protocol for Managing RDF Graphs"] from Nathan on 2011-03-26 (public-awwsw@w3.org from March 2011)

From: Nathan <nathan@webr3.org>
Date: Sat, 26 Mar 2011 22:58:57 +0000
To: David Booth <david@dbooth.org>
CC: Pat Hayes <phayes@ihmc.us>, Tim Berners-Lee <timbl@w3.org>, Kjetil Kjernsmo <kjekje@ifi.uio.no>, SW-forum Web <semantic-web@w3.org>, Jonathan Rees <jar@creativecommons.org>, AWWSW TF <public-awwsw@w3.org>
Message-ID: <4D8E6FB1.7010608@webr3.org>
Hi David :)

(aside: cc'd to AWWSW TF for archiving as highly relevant to the work we 
do there)

David Booth wrote:
> Hi Nathan,
> 
> Excellent and very insightful analysis!  The "giant, global 
> graph with unique identities" approach that you describe 
> is fine for some limited application areas, such as:
> 
>  - within a relatively small, controlled environment; or
> 
>  - with applications that are willing to assume the risk of
>    unstable definitions.
> 
> But it is not sufficient as a general approach at web scale.
> 
> The reason, in essence, is that it sets up an endless guessing
> game between a URI's owner and its users: the URI owner
> thinks of a unique resource, but provides a definition that
> only gives hints about it, and the users of that URI must
> guess its identity.  Each time the URI owner updates the
> definition to add more hints, some of those users discover
> that they guessed wrong, and, through no fault of their own,
> their work is no longer consistent with the URI's definition.

Unless some out of band communication has gone on of course, meaning is 
given by everything from comments in a file to use in human to human 
communication.

> I'll explain in more detail, but the explanation involves
> multiple steps, so bear with me.
> 
> 1. I assume you mean this "giant, global graph" to be
> consistent, since otherwise it would be meaningless.

I doubt you'd ever know if it was consistent or inconsistent tbh, since 
the odds are extremely low that you'd ever see it all, and the 
practicalities involved in getting it all would mean that it had been 
invalidated by time (well, changed) by the time you had it, and even 
then, you'd never know if there was some bit of it that you missed!

> Incidentally, I've been referring to this as myth #2:
> http://dbooth.org/2010/ambiguity/paper.html#myth2 .
> 
> But how on earth could we expect to know what that giant,
> global graph should be?  Obviously we cannot assume that
> it consists of the merge of *all* RDF graphs, since that
> would clearly be an inconsistent mess.  On the web, anyone
> can say anything about anything, and much of it is rubbish.
> So we cannot, in advance, *assume* that we have such as graph
> and use that as the basis for showing how a "unique identity"
> approach works based on that assumption.
> 
> Instead, we need to go in the opposite direction: start with
> two graphs that we *can* assume are (individually) consistent,
> and then *merge* them to come incrementally closer to that
> idealized, giant, global graph.  As with proof by induction,
> if we can show that an approach to resource identity works
> for *one* small graph, *and* we can show how it works when
> two graphs are merged, then we have shown how it can work
> on increasingly larger graphs.  Thus, in the limit as time t
> goes to infinity we would reach nirvana, where all knowledge
> of the universe has been formally encoded, and there is only
> one, unique interpretation of the graph: every URI uniquely
> identifies exactly one resource. ;)
> 
> I imagine this was the intent behind your idealized "giant,
> global graph", so now let's proceed in this direction.

That would be ideal yes, and hope it's the intention of most of us!

> 2. To avoid vagueness, and to prevent the possibility of any
> hidden "then a miracle occurs" step,
> http://star.psy.ohio-state.edu/coglab/Miracle.html 
> let us assume that the resource definition is provided only in
> RDF -- not natural language.  This assumption seems reasonable
> because: (a) RDF definitions facilitate machine processing,
> which is the whole point of using RDF to begin with; and (b)
> in principle any natural language definition could be expressed
> in RDF.

Perhaps the above is the critical difference in view points, one 
supposes the above to be a constraint (or should be accepted as being 
vital) and the other supposes that meaning is given by more than just 
RDF, it's also given by natural language, communication, out of band 
knowledge and what we do must consider that "real world" communication 
too. Maybe, I'm not sure.

> 3.  Now suppose that a URI owner, Oliver, mints a URI u that is
> intended to uniquely identify a particular resource that he has
> in mind -- Nathan's TV.  As we know already, it is not possible
> for Oliver to describe this resource unambiguously, so as a
> simple example, let us assume that he (initially) provides a
> definition containing only the following assertions in graph gd:
> 
>   # Oliver's definition of <u> -- graph gd
>   <u> a :TV .
>   <u> :hasOwner :Nathan .
> 
> 4. Next, an RDF statement author, Alice, uses Oliver's URI to
> publish a new RDF graph, ga:
> 
>   # Alice's graph ga
>   <u> :alphaMax 27 .
>   . . .
> 
> Since <u> is supposed to identify a unique resource globally,
> Alice would like to verify that the resource she *thinks* <u>
> is supposed to identify determine whether her new RDF graph,
> ga, would give the URI the same resource identity than Oliver
> has in mind.  But given only the URI's resource definition
> (graph gd), how can Alice possibly determine this?  
> 
> Clearly it isn't reasonable on web scale to expect Alice to
> personally ask Oliver for clarification.

Perhaps it should be made possible where reasonable to do so :) could be 
most useful.

> So, barring magic
> or miracles, the best Alice can do is to merge her graph ga
> with Oliver's resource definition gd and check for consistency.
> But, even if the merge is consistent, that does *not* indicate
> that Alice's graph ga actually *does* use the URI to denote the
> exact same resource that Oliver intended.  It only indicates
> that it *could*: the merge admits at least one satisfying
> interpretation.

Assuming that <u> is the same identifier in both ga and gd ...

If it does not indicate that the <u> is used to denote the same resource 
in both graphs, how would you ever know? in the merged graph how would 
you know which statements came from where? Surely that means that in the 
following (merged) graph all three statements might be about three 
different resources?

  <u> a :TV .
  <u> :hasOwner :Nathan .
  <u> :alphaMax 27 .

If we work on that assumption, and all graphs we ever encounter may eb a 
merge of other graphs, then why have names at all?

> In other words, all that Alice can determine is that the
> cloud of possible resources that <u> *might* identify
> in gd and ga overlaps, as illustrated in Figure 18 here:
> 
> 
> 5. Note that Alice's graph contains an assertion that makes
> further assumptions about the identity of <u>.  In essence,
> she has made a *guess* about the true, unique identity of <u>.
> This is normal: *anything* that Alice's graph may say about
> <u> that is not already entailed by Oliver's definition runs
> the risk of being "wrong" when Oliver tightens his definition.
> And it is likely that Alice *will* make statements about <u>,
> because, after all, she has chosen to use <u> in her graph
> for a reason.

Perhaps if they had namespaced their names to (oliver, tv) and (alice, 
tv), and then Alice had said that the two names referred to the same 
thing, this mess could be avoided (or at least cleaned up), since Alice 
could later remove that equivalence assertion.

> To phrase this in terms of the RDF Semantics, Alice's
> statements add constraints that reduce the set of satisfying
> interpretations.  For example, in this case Alice has eliminated
> all possible interpretations in which the thing's alpha --
> characterized by a :alphaMax and :alphaMin -- is greater
> than 27.
> 
> 6. Next, a different RDF statement author, Bob, 
> publishes a different graph gb using Oliver's URI:
> 
>   # Bob's graph gb
>   <u> :alphaMin 43 .
> 
> Bob and Alice know nothing of each other's work.  Bob makes
> the same consistency checks that Alice made, and his graph is
> also consistent with Oliver's definition.
> 
> 7. Next, Charlie wishes to merge Alice's graph ga with Bob's
> graph gb, but since (we'll assume) something's alpha value
> cannot have both a maximum of 27 and a minimum of 43, he finds
> that the merge is inconsistent.  What can Charlie do?
> 
> Charlie cannot convince either Alice or Bob to "fix" their
> data, because neither of them sees a problem with their data.
> In theory Charlie could first try to convince Oliver to tighten
> up the definition of <u>, and *then* he might convince Alice or
> Bob -- whoever had guessed wrong about the alpha value -- to fix
> his/her data, but this is not feasible to expect at web scale.

In all honesty David, I'm wondering why Alice and Charlie are making 
statements about my TV without ever having seen it.

To jump right to the core of this though, without the notion of who is 
saying what in RDF, I don't see how this can work (well). There will 
always be inconsistent data, and different view points, and it's up to 
us to figure out what we're believing to be true and what we are not; 
and that we cannot do by blindly merging assertions, rather we need to 
consider different combinations and come to our own conclusions about 
what is true and what is not, who to trust, and who not for whatever 
purpose.

> Probably the best that Charlie can do is to either: (a)
> make his own guess about whether to side with Alice or Bob,
> and manually discard some of the other's assertions;
> or (b) split the identity of <u>, as described in
> http://dbooth.org/2010/ambiguity/paper.html#splitting

or (c) not offer propositions blindly about things he doesn't know (you 
must understand what some is pretty well to offer such a proposition as 
alphaMin).

> Observation: At web scale we cannot expect RDF statement authors
> to be able to influence other people's URI definitions or RDF
> data, but statement authors still need to be able to make RDF
> statements using other people's URIs.

This can be broken down considerably though, it depends very much on how 
the other persons URI is being used, for example the URI of a property, 
the URI of a class, the URI of a color and countless more may all be 
used in the p or o positions of statements pretty reliably, one has to 
be pretty sure of what a URI refers to when using it in the s position 
though!

> 8. Now let's consider what happens when Oliver *does* decide
> to refine his definition, since this is the only way he can
> hint at the unique identity of <u>, and the objective is to
> continually tighten our definitions until we reach nirvana.  :) 
> Oliver adds the following triple to his definition, gd2:
> 
>   # Oliver's new definition of <u> -- graph gd2
>   <u> a :TV .
>   <u> :hasOwner :Nathan .
>   <u> :alphaMax 32 .
> 
> Through no fault of Bob, Oliver has just broken Bob's graph gb,
> because gb is now inconsistent with Oliver's new definition,
> gd2.  Regardless of the fact that Bob's graph gb may contain
> valuable information, it is now clear that <u> cannot identify
> the same resource in gb as it does in gd2.

I'm not sure that is clear, more clear is that Bob did not know what he 
was speaking about when he made his statement...

> Furthermore, if we play this through farther, the more Oliver's
> definition of <u> is updated and tightened to more precisely
> identify the true resource that Oliver intended, the more it
> becomes inconsistent with existing graphs that used <u>.

used <u> in the subject position to make statements about something they 
did know that is.. how often and who is actually doing this?

> Finally, since Oliver's definition itself may have used other
> URIs whose definitions may change, Oliver would likely be forced
> to rewrite it *differently* -- not just tighten it -- when
> some of those definitions change and it becomes inconsistent,
> thus breaking Alice and Bob's graphs in a different way.

Alice and Bob's graphs are not "broken" though. (1) they were describing 
something they did not know and were not in a position to be describing 
(2) the merge of the graphs have inconsistencies.

Typically this would lead me to believe (rightfully) that Alice and Bob 
didn't have a clue what they were talking about, and I'd reject their 
offered assertions.

> In essence then, the very process that was intended to bring us
> closer to the goal of a giant, global graph is the same process
> that causes instability, and the more we advance toward that
> goal, the more instability we create.
> 
> This kind of instability may be manageable in a small, closed
> environment where you can control all of the definitions and
> keep them all in sync.  And it may also be an acceptable risk
> to *some* applications.  But it is not a workable approach at
> web scale for applications that need a more stable foundation.

Sorry, but that doesn't follow for me. All that needs to happen here is:

1) namespace names to include the "source" or voice of who's making the 
statements ( Oliver , tv ) / <olivers-description#tv>

2) don't make bold statements about something when you do not know what 
that thing is

3) be prepared to have different / conflicting / inconsistent 
information about things when you merge graphs from different sources, 
and be prepared to work out what you "believe" and what you don't (in 
the context of the current question you're trying to answer of course).

> 9. What is the alternative?  For semantic web architecture
> to work at web scale, I see no option but to acknowledge the
> essential ambiguity of resource identity, precisely *bound*
> that ambiguity with URI definitions (a/k/a URI declarations),
> and learn to live with it.  Each definition will be precise
> *enough* for some applications even as it is ambiguous for
> others.

Vastly different view points and approaches it seems! Or perhaps we see 
the problem differently and thus have different solutions (which we 
reasonably would, if we've been seeing the problem differently, or 
working from different sides of it).

Interesting for sure.

Best,

Nathan

> Specifically, instead of assuming that a URI definition is
> an incomplete description of a globally unique resource,
> assume that the definition is the *complete* description
> of the resource: the definition is all you get, and *any*
> interpretation that is consistent with it is legitimate.
> 
> This permits an application to know just enough about a URI's
> resource identity to get its job done, while providing a stable
> foundation for RDF authors.
> 
>      ------------------
> 
> A few more inline comments below . . .
> 
> On Wed, 2011-03-23 at 01:12 +0000, Nathan wrote: 
>> Hi Pat,
>>
>> Here's how I see it (discussing things we can't see again).
>>
>> On a universal scale (as in giant global graph) we have a set of nodes, 
>> each node is associated with one or more unique names, and one or more 
>> propositions. Each node can be seen as having a 1-1 relation with a 
>> single distinct thing (whether real or abstract), and the set of 
>> propositions bound to that node can be seen as characterizing (not 
>> defining) the thing which the node is related to. Exactly what those 
>> propositions characterize is open to interpretation, and when you're 
>> only working with subsets of the global graph (as is the norm) what the 
>> node is interpreted as characterizing gets increasingly less specific 
>> ever more ambiguous.
>>
>> If we split the previous paragraph in half, then by looking at only the 
>> first half we can argue that each name has at most one referent, and 
>> each thing can have multiple names (a many-1 relation). If we look at 
>> the second half then we can argue that each name can have multiple 
>> referents, and each thing multiple names (a many-many relation).
>>
>> An application may not need to consider or know every property of a 
>> thing to answer the question it is being asked, and may not need to (or 
>> be able to) make distinctions between unique things.
>>
>> So, to what does a name refer?
>>
>> To me it is important to view each name as having at most one referent, 
>> then if you tell me that you interpret the name as referring to 
>> something else, I can offer some more propositions and refine my 
>> description, in order that we may collectively describe the world and 
>> hopefully start to understand each thing.
> 
> If you add those propositions to your existing URI definition then you
> risk breaking downstream applications that used your URI.  This may be
> the policy that you want, and if so it is important to publish your
> change policy, so that others can choose whether to accept this risk.
> But for more stability, you can instead mint a new URI with a tighter
> definition.  There is a trade-off between the two policies.
> 
> David Booth
> 
>> So, whilst I understand that the distinctions don't always matter, and 
>> that it's generally nigh on impossible to define a thing unambiguously, 
>> I still feel it is critically important to view each name as having a 
>> single referent, and to view each name as identifying a unique thing, 
>> unless told otherwise (by proposition or inference).
>>
>> in-line:
>>
>> Pat Hayes wrote:
>>> On Mar 20, 2011, at 10:30 AM, Nathan wrote:
>>>> This is why we couple descriptions to names, to give an indication of what we are using a name to refer to, sure our descriptions may be ambiguous and open to refinement, but our names are not; because we are not using simple string token names "everest" or "lightbulb", we're using distinct URIs.
>>> So, are you saying it is the *syntax* of URIs which gives them this magical quality? So one gets unambiguous reference by putting a colon in the name somewhere?  OK, forgive my sarcasm: but if this is not what you are saying, just what ARE you saying, that gives URIs this amazing ability to reach out into the world and seize upon their single unique referent?
>> The point I was trying to make (badly) was two fold:
>>
>> 1: Rather than saying "when I say X I mean this" and "when you say X you 
>> mean that" (where this != that) as humans with limited vocabulary often 
>> do. We can instead use URIs with gives us a wider vocabulary and greater 
>> opportunity to have one or more unique names for each referent.
>>
>> 2: The magical quality is in the specs and a social agreement, that we 
>> will typically consider each URI as having at most one referent, thus 
>> allowing us to say that each URI unambiguously identifies a single 
>> thing; even when the interpreted characterization of that thing is 
>> ambiguous.
>>
>>> [snip]
>>>> So, I have to conclude that the names aren't ambiguous here
>>> What would lead you to that conclusion? I don't see that you have argued for it anywhere. Like TimBL's claim, it seems to be a matter of W3C Dogma rather than an actual observation or even a rationally defended position. And as it is radically false, and indeed in many cases *provably* false, it seems rather obtuse to be defending it with so slender an excuse or argument. 
>> Hopefully the above helps explain my own personal thinking on it, well 
>> as well as I can understand things given my limited knowledge.
>>
>> Best,
>>
>> Nathan
>>
>>
>>
>
Received on Saturday, 26 March 2011 23:00:22 UTC