Re: RDF Keys, or why RDF is lousy at metadata annotations from Bob MacGregor on 2004-03-16 (www-rdf-comments@w3.org from January to March 2004)

From: Bob MacGregor <macgregor@ISI.EDU>
Date: Tue, 16 Mar 2004 00:14:40 -0800
To: Pat Hayes <phayes@ihmc.us>
Cc: www-rdf-comments@w3.org
Message-Id: <6.0.3.0.2.20040315224518.01c71008@tnt.isi.edu>
Hi Pat,

Since virtually every point you raise can be argued successfully, I will
break my vow of silence and answer them:

At 12:57 PM 3/15/2004, Pat Hayes wrote:

>Hi Bob
>
>I am sympathetic, and I understand you don't want get into a long debate; 
>but I don't know what you are talking about.
>
>>Its claimed that RDF and OWL are really great because they
>>facilitate making metadata assertions.
>
>Is it? By whom? My understanding is that they are intended to be languages 
>for putting ontologies on the Web.

That's very unfortunate.  I would say that ontologies are one of the less 
important
uses of RDF (OWL I find much less significant than RDF, so its intended use is
of less interest to me).   RDF is a very good interlingua for representing 
information
from other sources, and, modulo the shortcomings I detailed earlier, its 
very good
for annotating other data.  I find both of those uses quite compelling, and 
I'm not
alone in making those claims.

>>The reality is that the
>>RDF and OWL standards do a lousy job at supporting metadata
>>annotations for (at least) two reasons.
>>
>>To do a good job of annotating (attaching metadata) to something, you need
>>    (1) to reify the something, and
>>    (2) the URI needs to be globally unique, and it needs to be "repeatable".
>
>OK, granted (though why do you need to reify it? Isnt referring to it good 
>enough?). Both of these issues seem to be concerned with how to attach 
>URIs to the somethings, which indeed is an issue that the RDF/OWL specs 
>mostly do not address, I think because it was outside the charter of the 
>relevant working groups. (OWL does talk some about giving URI names to OWL 
>ontologies, but not in any detail, and not about any other kind of naming.)

Attaching URI's to "the somethings" did appear to be outside of the 
charter, but it has been
the source of almost unending debate within some RDF e-mail groups.

>>By now its well understood that RDF statement reification is a loser,
>>because its chooses too small a grain size.
>
>Well, I tend to agree, but that seems irrelevant to your concern, since 
>RDF reification was only ever intended to reify RDF itself. I take it that 
>you are interested in using RDF/OWL to give metadata for things other than 
>RDF/OWL (?? Or are you just concerned with doing a better job than RDF 
>reification for using RDF to describe other RDF?? If so, join the club :-)
>
>>  The best remedy is to
>>add contexts and quads to RDF.  If we did that, then (1) is taken
>>care of.  However, that's a subject for a different e-mail.
>>
>>Here I'm really addressing the Bnode problem.
>
>Eh?? What bnode problem? (Did you just change the subject entirely, or is 
>there a problem relating bnodes to URI attachment and reification?)

Bnodes relate to URI attachment.  There are two kinds of bnodes:  The first 
are equivalent
to skolems.  The second represent unnamed entities, such as the majority of 
entities
referenced by tags in XML specifications.  If these entities had names, 
they would obey
the unique name assumption (i.e., they would be recognized as 
disjoint).  Unfortunately,
there are no provisions in RDF for representing the second kind of 
bnode.   However,
a large number of RDF applications use bnodes to represent them, for lack of an
alternative.  This is indeed unfortunate, and indicates a poor  match 
between RDF and
real-world applications (since something like 99% of all bnodes are of the 
second type).

If we had keys, plus a scheme for encoding keys within URIs, then we could
solve a significant portion of the bnode problem.

>>The proper scope
>>for a bnode is the model that it belongs too.
>
>The graph it occurs in, yes.
>
>>That means that
>>to reference a resource/entity outside of the model, you need
>>something other than a bnode
>
>No, that does not follow at all. The bnode has a syntactic scope - it is 
>essentially a bound variable -  but the entity or entities REFERRED TO by 
>the bnode aren't scoped or limited in any way. Bnodes can refer to 
>anything in the universe of discourse.

Sorry, I misspoke.  I should have said, in order to reference the node 
itself outside of the model,
you need something other than a bnode.

>Did you mean that in order to provide a name for something that can be 
>used outside the graph, you should not use a bnode as a 'label'? That 
>would be correct, but I don't see why you call this a "problem". It seems 
>kind of dumb on the face of it to use a blank node as a label. That isnt 
>what blank nodes were ever intended to do.
>
>>--you need a resource with a globally
>>unique URI.
>
>Right. But all URIs are globally unique, so you just need to have a URI 
>attached to a resource.

Nope, you need more than that.  Its necessary that the URI you pick can be 
correctly
interpreted to denote whatever real world entity it refers to independent 
of which model
you are in.  For example, suppose I want to refer to the state of the SS. 
Lexington on Jan 4, 2002,
and suppose the facts about that ship are recorded somewhere in an XML file.
I can't just give it the URI  "...#ship42" and expect to do anything useful 
with that URI,
because I don't have a W3C-sanctioned means of relating "...#ship42" to a 
particular
entry (tag) in the XML data source.

>>Its easy, but quite useless for annotation purposes, to use
>>"gensym" URIs, where you generate a unique URI on the fly, because
>>the next time you load the model, you get a different URI.
>
>But they are both unique. I think you mean, a globally unique and stable URI.

I mean, every time I load the data source, the denotation of the URI is the 
same.  That's
presumably easy to guarantee if the data source is an RDF file or database, 
but not easy
if its some other kind of source.  (Note, because of the paucity of RDF data,
a vanishingly small percentage of the RDF data I deal with comes from RDF
sources).

>There is absolutely no way to ensure that anything (data or otherwise) has 
>only a single URI which refers to it.

I never said that an entity could not have more than one URI.  I want a URI 
that
always denotes the same entity (that's not controversial) AND I want a 
computational
scheme for determining what that entity is, given the URI.

>>In other
>>words, the URI is not "repeatable".
>
>Well, the old URI still works, right? (If it worked before, what made it 
>stop working because some other URI got generated??)

Its easy to have URI's come out of a non-RDF data source that can't be "put 
back in".
If an RDF URI represents a tagged item in an XML file, its impossible to 
determine WHICH
tag that was, unless you use a non-trustworthy hack like sequence numbers 
(e.g.,
Ship42 denotes the same thing as the 42nd occurrence of a ship tag within the
RDF file) or unless you use something dependable to make the back link, 
like a key.

>This seems to be the core issue. (?) That is, you want a way to 
>(automatically?) generate URIs which can be used to name data (and hence 
>to anchor metadata) stably, so that once created, they remain attached to 
>the data they were created for. OK, good idea. But I don't think this is a 
>job for RDF/OWL (and it wasn't in their charter): I think it has to do 
>with URI deployment more generally.

I agree, it should apply more generally.  However, if keys are an integral 
part of the solution
(which I claim to be the case), then only languages with some semblance of 
a semantics
are likely candidates (i.e., this tends to leave XML out of contention).

>The fact is, as I tried to tell the W3C Tech Plenary meeting in Boston 
>last year, the Web as a whole has no way to give a name to ANYTHING. It 
>has no protocols for baptism or naming: it just muddles along by relying 
>on HTTP protocols and a few other essentially transmission protocols for 
>locating things in networks, and pretends that it is assigning referents. 
>But in fact, nothing on the Web really assigns referents at all.

Very true, from a logical standpoint.  However, its easy to write 
applications that
use URIs that work very effectively.  If the job was uniformly easy, I 
would have no
complaints, some URIs are much harder to produce than others.   For example, it
takes a bit off effort to produce a URI that denotes a row in a database.
My preference would be to attempt a partial remedy (knowing that
all attempts will be partial at best).

>>  To achieve repeatability, some misguided
>>proposals suggest concatenating or hashing all of the values of attributes
>>of a resource to create a repeatable URI.  If you do that, the attributes
>>of a resource cannot be updated.
>>
>>The right solution is to generate a URI based on a minimal set of attribute
>>values that can be guaranteed not to change.
>
>Where are those attributes to be found, in general? A URI can identify (= 
>be the name of, refer to) anything. Most things don't have unchanging 
>attributes.

Social security numbers don't change.  Timestamps don't change.  Geospatial
coordinates of fixed objects don't change.  The left-to-right or 
front-to-back order of similar
components on a piece of hardware often doesn't change.  Perhaps most things
in nature don't have unchanging attributes, but a rather significant percentage
of physical artifacts have fixed, or nearly fixed attributes.  Perhaps you 
have heard
of relational databases, which assume that everything has a key, which by 
definition
has attributes that do not change.  Sorry to be a bit flippant, but your 
last statement
seems completely disconnected from digital reality.  Perhaps because 
Florida is so
far South :)

>>This is the definition of a
>>"key" (or primary key).  Proper database tables have primary key definitions.
>>XML items don't have keys, but they should.   And RDF classes should define
>>keys.
>
>I presume you mean RDFS classes (and OWL classes?). I don't see how this 
>would be possible in general. Are you saying that all classes should 
>provide keys? But classes might contain anything. What about classes of 
>things like abstractions, or classes of entities that have only a 
>transient existence? What about rdf:Resource?  What about entities that 
>are not known to be in any particular class? What about entities that are 
>in more than one class?

I'm saying that SOME classes should provide keys.  Others won't.  If A and 
B are classes that both
happen to have keys, and if entity E has type A and type B, then E will 
have two keys.

>>Defining a key on a class C in RDF is very simple.  A key consists of a
>>set of properties--the order doesn't matter.  If P1 and P2 are properties
>>that define a key for C, then we can invent a new property 
>>"hasKeyProperty" and
>>make two statements:
>>
>>C hasKeyProperty P1 .
>>C hasKeyProperty P2 .
>>
>>and we're done.
>
>Not quite. That does not refer to a set of properties. You cannot assume 
>that because
>
>C hasKeyProperty P19 .
>
>has not been asserted, that it will not be true. In general in RDF there 
>is no way to 'close off' any collection other than by using the collection 
>vocabulary.

Actually I HAD forgotten that.  This point probably deserves a bit more 
discussion, but
from a practical standpoint, it may not matter if the collection is closed 
or not.  I would
have to reinterpret my semantics as something like "C hasKeyProperty P1" means
that there exists a key for C that includes as one of its components the 
property P1.
So in my example above, I might not be able to assume, based on RDF semantics,
that P1 and P2 both belonged to the same key.   However, in practice, I could
safely make that assumption.

If this sounds a bit less than airtight, I suggest you take another look at 
your
semantics for reified RDF statements, which embody a far more significant 
set of
disclaimers.

>But in any case, I fail to see how this provides what you want. Suppose 
>this triple/s is/are asserted: how does that give you an actual key? For 
>example, you might not know any values for those properties.

Databases usually consider the values of key columns to be required attributes.
But, if there is a null value in a key position, then you can't construct a 
URI for
that resource -- you are back to using a bnode.

>>Now we have the foundation needed to synthesize
>>unique and repeatable URIs.
>
>I do not follow you. In what way does this provide a basis for 
>synthesizing URIs ? And how does membership in a class provide any 
>guarantee of repeatability? There is no implication that a class must 
>contain permanent things, or even that the class itself has any permanent 
>or lasting existence: consider for example classes which are themselves 
>only mentioned in passing, such as OWL restriction classes

Existence is not relevant to the discussion.

Some RDF classes correspond to tables
in a relational database.  Some RDF properties correspond to columns in a 
relational
database.  Borrow the semantics from a database book about primary keys and
then reverse the mapping.  Then assume that each set of key values
will be embedded in the corresponding URI using an invertible encoding scheme.
This seems obvious to me, but maybe its not obvious to everyone.

>>Hence, my proposal for the follow-on to the current RDF is to define
>>a new predicate equivalent to "hasKeyProperty".
>
>If I follow this proposal, how does
>
>C  rdf2:hasKeyProperty  P .
>
>differ from
>
>P owl:domain C .
>P rdf:type owl:functionalProperty .
For non-composite keys, no difference.

>?
>
>>Let me address two possible objections. One is that there may exist
>>more than one set of properties that defines a key for a give class.
>>The same is true for database systems, but they have wisely chosen
>>to identify one key as "primary" and declare that one as "the" key.
>
>That is the least of the problems. The most serious one is that many 
>(most?) classes will have no key properties.  Others are: given an 
>individual, how to choose which class to use to identify it; how the key 
>properties of a derived class are supposed to relate to those of the 
>deriving properties; how to handle owl:sameAs (ie equality) in a key-based 
>class framework; and how to relate this use of URIs to other constraints 
>on URI usage.

Having no key for a class is not a problem (why should it be?).

Given an individual, how to choose which class to use to identify it -- 
probably you
have to pick arbitrarily among those classes which have keys.  In that 
case, all
choices are good.

Derived classes may or may not pose a problem.  If they are indeed a problem,
we might decide that they don't have to have keys.

If two entities E1 and E2 both have type C, and they have identical attributes
for the key properties for C, then  E1 owl:sameAs E2.   And vice-versa.


>>Second, the strategy for forming a unique URI based on a set
>>of key values is left open.  It would be REALLY useful if the
>>committee also tackled this problem.
>
>Way beyond the scope of an RDF WG. This is a TAG-level issue.

That may be so.  However, I deliberately confined my suggestion to keys, since
that is not a TAG-level issues (IMHO).

>>Note: I'm posting this to RDF comments because I'm not soliciting
>>debate on this issue.  Rather, I would like to see it added to the
>>issues list for the next RDF committee.
>
>I'm not empowered to do anything about this, but IMO this topic isn't even 
>in the scope of an RDF group. It is both too special and too general for 
>consideration only within the context of RDF.

Interfacing RDF with XML and databases has the possibility of developing into
a huge application area.  Products are already coming out that assume that keys
exist for the RDF resources representing translations of XML tags or rows in a
database table.  If the RDF group says that keys don't belong there, and 
the OWL
group says that keys don't belong there, then then vendors will each invent 
their
own standards and their own semantics, yielding just the kind of babel that
ontologies are supposed to circumvent.  Its amazing to me that the committee
can devote enormous resources to something as useless as reified statements,
and ignore (thus far) something considered by the database community to be
fundamental and absolutely essential.

It does occur to me that adding keys to RDF would be an anathema from
one standpoint:  Currently, it is not possible to produce a set of RDF 
statements
that contain a contradiction.  If RDF had keys, then clashes would be possible.
I'm not sure if the guarantee to be contradiction-free is considered to be 
a virtue
or an accident.

Cheers, Bob
Received on Tuesday, 16 March 2004 03:16:57 UTC