RE: Syntax vs Semantics vs XML Schema vs RDF Schema vs QNames vs URIs (was RE: Using urn:publicid: for namespaces) from Patrick.Stickler@nokia.com on 2001-08-16 (www-rdf-interest@w3.org from August 2001)

From: <Patrick.Stickler@nokia.com>
Date: Thu, 16 Aug 2001 14:54:14 +0300
To: dallsopp@signal.dera.gov.uk
Cc: www-rdf-interest@w3.org
Message-ID: <2BF0AD29BC31FE46B78877321144043114BF90@trebe003.NOE.Nokia.com>
> -----Original Message-----
> From: ext David Allsopp [mailto:dallsopp@signal.dera.gov.uk]
> Sent: 16 August, 2001 13:24
> To: Stickler Patrick (NRC/Tampere)
> Cc: www-rdf-interest@w3.org
> Subject: Re: Syntax vs Semantics vs XML Schema vs RDF Schema vs QNames
> vs URIs (was RE: Using urn:publicid: for namespaces)
> 
> 
> 
> 
> Patrick.Stickler@nokia.com wrote:
> > 
> > > > rather than a single opaque URI identifier.
> > >
> > > But this is just querying - you have to do that anyway to
> > > find out what
> > > the "opaque URI" actually is.
> > 
> > Why would you need to find out what a URI "is". Do you
> > mean dereferencing it? Surely dereferencing of URIs is not
> > required for any kind of RDF based inferencing.
> 
> Ok, I'm confused as to what you meant - you appeared to be saying that
> it was difficult to refer to the anonymous resource because you would
> have to use its 'surrounding' related nodes to identify it.  My point
> was that even if the node does have an opaque URL, if the 
> data are at a
> remote site or agent you have no idea what that URL string 
> is, and have
> to form a query in order to find out.  This query would of course use
> the related nodes, so the situation is no different.

But surely such dereferencing of resource URIs is separate from
any inferences being made to a given knowledge base. An agent might
query an RDF knowledge base (agent) to get the URI of a resource from 
which it might be able to retrieve more knowledge, which could
be syndicated into the RDF knowlege base, but such retrieval and
syndication of knowledge is outside the scope of basic query
and inference processes operating on the knowledge base proper. 

Right?
 
> If on the other hand you have accessed and parsed that RDF 
> locally, you
> will have generated a local ID for that resource and can refer to it
> using that ID.
> 
> > Even if some application may wish to dereference a URI for
> > some purpose, that URI is not a "URI" per se to RDF, it is
> > simply an opaque universal identifier, no?
> 
> Yes; I wasn't suggesting dereferencing.

Great. Didn't think you were ;-)

> > > John --hasFather--> [] --age--> 84
> > >
> > > John --hasFather--> [] --age--> 84
> > >
> > > compared with
> > >
> > > John --hasFather--> randomgenid0123456789 --age--> 84
> > >
> > > John --hasFather--> randomgenid9876543210 --age--> 84
> > >
> > > where [] represents an anonymous node.
> > >
> > > The point is that we don't know the name of John's father, so
> > > assigning
> > > him a random name makes our life harder, not easier, 
> since everybody
> > > necessarily assigns him a _different_ random name.
> > 
> > But this is exactly my point. There is no such thing as an anonymous
> > node! It always gets a randomly generated system identifier!
> 
> So what? In principle, the system can keep track of which nodes are in
> fact anonymous and distinguish them from the others.

Eh? Fair enough, in principle, but is that a requirement of the RDF spec?

Can you tell me of any RDF engines that not only differentiate between
anonymous nodes but also treat disregard their identities during inference?
(I'd love to know of them, so I can use them)

And regardless, if it is not a requirement of the standard, then that
means that I can't have portable, generic SW agents that can rely on
such interpretation and treatment of anonymous nodes by any arbitrary
RDF engine which is compliant with the standard.
 
> [Aside: perhaps this is rather like the Robinson Crusoe 
> story, where he
> meets a foreigner on his island; not knowing his name, and having met
> him on a Friday, he calls him "Man Friday". The man presumably has a
> real name, but we don't know it - we have to call him _something_, but
> we acknowledge it isn't really his name.]

Well, I'd still say that "Man Friday" is a name, even if it isn't
what he would consider his "primary" name.

E.g. "New York City", "The Big Apple", etc. ;-)
 
> > So if I get the same statement twice (e.g. it happens to be defined
> > redundantly in two disparate sources) then a given system will
> > assign *different* system identities to each anonymous node
> > for each essentially equivalent statement.
> 
> Not necessarily - we have the option of keeping track of 
> which resources
> are anonymous and handling them specially if we want.

See above...
 
> > Would it not be far better to have a "variable" for an anonymous
> > node which is based on the fusion of the subject and predicate
> > identities. Thus rather than the current practice where
> > 
> >  John --hasFather--> [] --age--> 84
> >  John --hasFather--> [] --age--> 84
> > 
> > results in
> > 
> >  [John, hasFather, gen123]
> >  [gen123, age, 84]
> >  [John, hasFather, gen456]
> >  [gen456, age, 84]
> 
> This is what tends to happen, but we can in principle detect the
> anonymous nodes and do more intelligent merging.

But at what point should such merging be done transparently and in a
consistent fashion by all conformant applications?
 
> > which is *not* what was intended; we instead could get
> > 
> >  [John, hasFather, rdf:anonymous:(John)(hasFather)]
> >  [rdf:anonymous:(John)(hasFather), age, 84]
> > 
> > with neither redundancy nor irreconcilable equivalence, and
> > where the implicit but regular (not system dependent) identity of
> > an anonymous node is defined in terms of a special RDF specific
> > URI scheme and sub-type for anonymous nodes.
> 
> I have no objection to explicit identification of anonymous 
> nodes, but I
> don't think your suggested scheme solves the problem yet (nice idea
> though...):
> 
> John --hasFather--
>                   |
>                   [] --age--> 84
>                   |
> Jim --hasBrother--
> 
> What's the URI of the anonymous node here? If I add more triples
> pointing to it, then what?
> 
> [Actually there may be a wider problem here as I don't think 
> that graph
> can be serialized in XML RDF with an anonymous node 8-) So 
> some explicit
> identification scheme may be needed...]

But this is really the crux of my "beef" with anonymous nodes. Even if
the creator of the above knowledge doesn't know the "official" or "primary"
name of that resource, it needs to give it *some* shared name, or it has
to have a way to define equivalence between two implicit but regular
identities of that same resource. I.e.
 
   [John, hasFather, x]   
   [x, age, 84]
   [Jim, hasBrother, x]   
or
   [John, hasFather, rdf:anonymous:(John)(hasFather)]   
   [rdf:anonymous:(John)(hasFather), age, 84]
   [Jim, hasBrother, rdf:anonymous:(Jim)(hasBrother)]   
   [rdf:anonymous:(Jim)(hasBrother), age, 84]
   [rdf:anonymous:(John)(hasFather), 
    daml:equivalentTo, 
    rdf:anonymous:(Jim)(hasBrother)]

The benefits of such a standardized, consistent representation for
anonymous nodes is that (a) every application will use the same
identity, so no system specific identifiers, (b) equivalences are
explicit between different implicit identities of the same anonymous
node, so inference can exploit them, (c) anonymous nodes have legal
URIs for identity.

Of course, banning anonymous nodes and forcing the use of explicit
resource identity, even if not a primary identity, makes things alot
easier and simpler, even if it means alot more names -- and in fact
seems to be manditory for either serializations or triples defining
shared anonymous nodes.

> [neat encoding of statements]

I thought it was rather nice myself, thanks ;-)
 
> > Thus, the issue is not really so much about anonymous nodes but
> > that they are in fact *not* anonymous within a given system, being
> > given unique and disjunct identities -- nor are they really 
> anonymous
> > in the conceptual graph, as they represent a single actual resource
> > having an implicit identity based on their context within a 
> statement
> > (which all nodes have, even if given an explicit URI identity).
> 
> They are anonymous in the syntax, and have a temporary name in
> implementations (although one could probably come up with an
> implementation where they were treated specially and so only 
> really had
> a memory address or something).
> 
> Does something have to have a name in order to be distinct? I 
> don't see
> that it does - as we said before, it can be identified by its
> surroundings.  

Right, but for all practical purposes, identifying something
by its surroundings and naming something are really equivalent
in function (though the names might get rather long if the
context is very complex).

> Generating a name such as "thingNextToFoo" is just a
> convenience for this identification.
> I do belive that 'anonymous' nodes are different to others in that the
> name is _only_ a convenience, and could be changed at random without
> affecting anything (in principle - provided the change is distributed
> appropriately!).

Do you mean that the anonymous node doesn't correspond to a specific
resource? Or just that the system-generated internal identity of that
node could be changed? I of course fully agree with the latter.
 
> I guess the difference is that the name can be removed from any given
> graph WITHOUT LOSS OF INFORMATION, (only loss of 
> convenience). Removing
> any other name changes the graph, by removing information.

Unless that name is referenced somewhere.
 
Regards,

Patrick

--
Patrick Stickler                      Phone:  +358 3 356 0209
Senior Research Scientist             Mobile: +358 50 483 9453
Software Technology Laboratory        Fax:    +358 7180 35409
Nokia Research Center                 Video:  +358 3 356 0209 / 4227
Visiokatu 1, 33720 Tampere, Finland   Email:  patrick.stickler@nokia.com
Received on Thursday, 16 August 2001 07:54:22 UTC