Re: Syntax vs Semantics vs XML Schema vs RDF Schema vs QNames vs URIs (was RE: Using urn:publicid: for namespaces) from Thomas B. Passin on 2001-08-18 (www-rdf-logic@w3.org from August 2001)

From: Thomas B. Passin <tpassin@home.com>
Date: Sat, 18 Aug 2001 01:22:27 -0400
To: <www-rdf-logic@w3.org>
Message-ID: <008f01c127a5$c5a81a60$7cac1218@reston1.va.home.com>
[David Allsopp]
>
> Finally, this hinders you in identifying equivalent resources when you
> merge data from two sources, because the two sources have to assign a
> pseudo-unique URI to the resource; a third party can't then determine
> that the nodes really refer to the same thing, e.g:
>
> John --hasFather--> [] --age--> 84
>
> John --hasFather--> [] --age--> 84
>
> compared with
>
> John --hasFather--> randomgenid0123456789 --age--> 84
>
> John --hasFather--> randomgenid9876543210 --age--> 84
>
> where [] represents an anonymous node.
>
> The point is that we don't know the name of John's father, so assigning
> him a random name makes our life harder, not easier, since everybody
> necessarily assigns him a _different_ random name.

This apparently simple example actually illustrates how hard it can be to
merge data from different sources.  For this example to work the way David
seems to want, the system that performs that merging of data has to know
something more about the "hasFather" property.  Suppose, for example,  that
the two sources were like this instead:

John --hasFather--> [] --age--> 84

John --hasFather--> [] --age--> 86

Now the fact that the node [] is not labeled doesn't help us know how to
merge these triples.  We don't know whether one or both of the triples is
wrong, or whether John really has two fathers of different ages.

In the same way, in David's original version, it could be that John has two
fathers of the same age, and each data source happens to know about one of
them. The only way we could resolve this question is to know that there is a
constraint such that a person can only have one father.  But if we know
that, then David's second (labeled) form can also be resolved.

My conclusion is that this supposed advantage of having unlabeled nodes is
actually no advantage at all - labeled or unlabeled, we still need to know
something else about the predicate in question (besides its label) to be
able to merge data from separate data sources.  I suppose this would be
considered part of the semantics of persons and tof he hasFather predicate.

Now if John did indeed have two fathers after all, the situation would be
more complex, because on merging the two data sources, our processor would
have to convert the hasFather statements into some kind of a container,
which itself would apparently be an anonymous node.  It would be something
like this (with apologies for the undefined terms):

subj                               property                object
John                               hasFathers            []
[]                                   typeOf                container
[]                                   RDF_1
randomgenid0123456789
randomgenid0123456789  typeOf                malePerson
[]                                   RDF_2
randomgenid9876543210
randomgenid9876543210 typeOf                malePerson
randomgenid0123456789   age                   84
randomgenid9876543210   age                   86

Without the labels randomgenid0123456789, randomgenid9876543210 it would be
much harder to make this set of assertions, if it were possible at all.
Certainly we couldn't enter this data into a relational database, or write
it down on paper in a table like this, since we would have to use three
different "[]" symbols that looked the same but refered to three different
resources.  To be consistent, of course, I should have used a generated id
instead of [] for the container as well.

Indeed, if you think of a set of triples as being rows in a relational
database table, then ask what would be the primary key of that table?  The
only sensible answer is that the primary key must be the combination of the
subject and predicate.  The "semantics" could also be considered to be a
kind of "business rule", to use an expression from a different domain.
Taking this relational database viewpoint, each node must necessarily have a
value (or label), but it may be that a particular implementation could hide
the label, or exclude it from serialization.

Cheers,

Tom P
Received on Saturday, 18 August 2001 01:19:36 UTC