- From: Thomas B. Passin <tpassin@home.com>
- Date: Sat, 18 Aug 2001 01:22:27 -0400
- To: <www-rdf-logic@w3.org>
[David Allsopp] > > Finally, this hinders you in identifying equivalent resources when you > merge data from two sources, because the two sources have to assign a > pseudo-unique URI to the resource; a third party can't then determine > that the nodes really refer to the same thing, e.g: > > John --hasFather--> [] --age--> 84 > > John --hasFather--> [] --age--> 84 > > compared with > > John --hasFather--> randomgenid0123456789 --age--> 84 > > John --hasFather--> randomgenid9876543210 --age--> 84 > > where [] represents an anonymous node. > > The point is that we don't know the name of John's father, so assigning > him a random name makes our life harder, not easier, since everybody > necessarily assigns him a _different_ random name. This apparently simple example actually illustrates how hard it can be to merge data from different sources. For this example to work the way David seems to want, the system that performs that merging of data has to know something more about the "hasFather" property. Suppose, for example, that the two sources were like this instead: John --hasFather--> [] --age--> 84 John --hasFather--> [] --age--> 86 Now the fact that the node [] is not labeled doesn't help us know how to merge these triples. We don't know whether one or both of the triples is wrong, or whether John really has two fathers of different ages. In the same way, in David's original version, it could be that John has two fathers of the same age, and each data source happens to know about one of them. The only way we could resolve this question is to know that there is a constraint such that a person can only have one father. But if we know that, then David's second (labeled) form can also be resolved. My conclusion is that this supposed advantage of having unlabeled nodes is actually no advantage at all - labeled or unlabeled, we still need to know something else about the predicate in question (besides its label) to be able to merge data from separate data sources. I suppose this would be considered part of the semantics of persons and tof he hasFather predicate. Now if John did indeed have two fathers after all, the situation would be more complex, because on merging the two data sources, our processor would have to convert the hasFather statements into some kind of a container, which itself would apparently be an anonymous node. It would be something like this (with apologies for the undefined terms): subj property object John hasFathers [] [] typeOf container [] RDF_1 randomgenid0123456789 randomgenid0123456789 typeOf malePerson [] RDF_2 randomgenid9876543210 randomgenid9876543210 typeOf malePerson randomgenid0123456789 age 84 randomgenid9876543210 age 86 Without the labels randomgenid0123456789, randomgenid9876543210 it would be much harder to make this set of assertions, if it were possible at all. Certainly we couldn't enter this data into a relational database, or write it down on paper in a table like this, since we would have to use three different "[]" symbols that looked the same but refered to three different resources. To be consistent, of course, I should have used a generated id instead of [] for the container as well. Indeed, if you think of a set of triples as being rows in a relational database table, then ask what would be the primary key of that table? The only sensible answer is that the primary key must be the combination of the subject and predicate. The "semantics" could also be considered to be a kind of "business rule", to use an expression from a different domain. Taking this relational database viewpoint, each node must necessarily have a value (or label), but it may be that a particular implementation could hide the label, or exclude it from serialization. Cheers, Tom P
Received on Saturday, 18 August 2001 01:19:36 UTC