Meaning of URIRefs (new test case, comments on Concepts draft) from Sandro Hawke on 2002-10-24 (www-rdf-comments@w3.org from October to December 2002)

From: Sandro Hawke <sandro@w3.org>
Date: Thu, 24 Oct 2002 13:05:25 -0400
To: www-rdf-comments@w3.org
Message-Id: <200210241705.g9OH5P121350@wadimousa.hawke.org>

***** 1. New Introduction and Summary

In the editor's draft of RDF-CONCEPTS [0], you've added a lot of text
about the meaning of a URIRef coming from the web-content available at
its URI-part. It's an excellent and much-needed addition.

I want to underscore how important it is by pointing out that
social meaning is self-reinforcing. If people start to doubt the
importance of using URIRefs as they are defined (and begin to
experiment with their own incompatible meanings), the RDF specs are
likely to lose any authority in the matter. People need tremendous
confidence in the language in which they write their contracts if
they are to be held to those contracts. There must be very little
window for people to argue about what the definition of "is" is.

With that in mind, and with an eye towards prospects of automated
reasoning, I'd like to propose this test case:

<?xml version="1.0"?>
<!DOCTYPE rdf:RDF [
<!ENTITY animals "http://www.w3.org/2002/10/meaning/animals">
<!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns">
]>
<rdf:RDF xmlns:rdf="&rdf;#"
xmlns:animals="&animals;#">
<rdf:Description rdf:ID="spot">
<rdf:type rdf:resource="&animals;#Dog" />
</rdf:Description>
</rdf:RDF>

(I moved the hash-mark out of the entity for reasons which will be
clear later.)

This parses as:

_:x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/10/meaning/animals#Dog> .

and it should entail

_:x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/10/meaning/animals#Mammal> .

How? Because the document at "http://www.w3.org/2002/10/meaning/animals"
says that #Dog is an rdfs:subclassOf #Mammal.

Let me back up a little and clarify: we have three kinds of
entailment:

(1) RDF simple entailment, as in the MT [2], which says
things like every RDF graph entails its subgraphs.
This kind of entailment pays no attention to URIRefs.
(2) Entailment with the "rdf" and "rdfs" vocabulary terms
reserved, as in MT [2].
(3) Entailment where every URIRef is constrained in meaning
according to the web content available at its URI part.

Of course DAML+OIL defines its own entailment, as does OWL, as do my
various layered logic languages [6], but these should all be seen as
special cases of (3). The terms used by Dublic Core, RSS, Creative
Commons, and various other efforts may not define their meanings with
model theories or first-order axioms, but their terms are also
carefully defined, and in some cases their misuse would be
intollerable (and in the case of CC, perhaps even illegally!).

Type (2) entailment above should also be subsumed into type (3), by
putting normative pointers at the rdf and rdfs namespace addresses to
the appropriate Recs (when the Recs happen). In fact, the MT should
be more clear in distinguishing between (1), (2), and (3). (2) should
probably be in a separate document. Perhaps (1) and (3) should also
be separated, but they remain to describe the meaning inherent in all
RDF documents, regardless of any URIRefs which occur in it.

The point here is that an RDF document must be taken to assert the
truth of all the documents it names in the URI parts of its
node-labeling URIRefs. If those documents are available to a reader,
and the reader is capable of understanding them, the reader is fully
entitled to infer facts from the conjunction of the author's documents
and all the definitional documents. Moreover, the reader can
attribute these conclusions to the author; the author is responsible
for chosing terms (eg comic, clown) whose definitions he accepts.

There are many more details, below. I first approached this topic
without noticing the new text in the editor's draft, and spent more
time arguing why using the URI for the semantics was important. I'm
going to leave that text here, because some people are still probably
not convinced. If you are convinced, feel free to skip sections 3
and 4.

**** 2. A Few Notes on RDF-CONCEPTS [0]

I think you overplay the difference between formal and natural languages in
2.3.3 in the example with

B:oneOfThem rdfs:comment "This means the same as rdfs:subClassOf".

If we take rdfs:comment to provide normative natural language
information about the subject (and if it doesn't we need some other
property which does), then in fact C is still to blame for the insult
to C:JohnSmith. The failure of RDFS class reasoning to reach the
insult does not mean the insult is not style-3 entailed, in this case
via B:oneOfThem.

I think 2.3.4 is wrong: the predicate needs no special status. The
situation you're trying to prevent here is prevented by accepting the
namespace/URI owner as authoritative in defining the terms there.
(see my definition of definition in section 5.y).

Section 2.3.5 is also misleading: there is RDF-Simple-Entailment ("1"
above) and RDF-URI-Based-Entailment ("3" above), and that pretty much
covers it. At some URIs (eg OWL, RDF/RDFS, LX) you should find
appeals to natural language and/or mathematical definitions which are
not directly usable by machines, but the terms defined there can be
used to define other terms in a way which *is* amenable to automated
reasoning. One could try to distinguish between natural language
definitions and formal language definitions, but I'm not sure how that
would help, since automated reasoners vary so much in what kind of
formal languages they can handle.

***** 3. Older Introduction

If I receive and believe an RDF document, D, saying that D:spot has
rdf:type animals:Dog, and the animals schema says that animals:Dog is
a subclass of animals:Mammal, would it be right of me to infer that
D:spot has rdf:type animals:Mammal?

Your answer might be "never", "sometimes", or "always." If you say
"never," then I think you've missed the point of RDF and XML, with all
these URIs and namespaces. If you say "sometimes," then we need to
talk about the qualities of those times. If you say "always", we have
some consequences which might be problematic. (I will argue that the
correct answer is "always" and that the problems are manageable.)

In any case, I don't think the current working drafts are clear on
this issue. RDF-CONCEPTS section 2.3 [1] suggests to me the answer is
probably "always" and RDF-MT section 1.2 [2] says "sometimes" and that
it depends which vocabulary you are reserving. Such an answer from
the MT, while true in a sense, is fairly useless. I need to know when
I'm entitled to make the Dogs-are-Mammals inference, and I don't think
out-of-band negotation of the "reserved" vocabulary for each RDF
document is practical.

I'd like to apologize for raising this issue so late in the process,
but my understanding of it has only become clear in the past week.
Previously, I had some vague notion that we could "float" the meaning
of RDF identifiers, but I no longer think that is practical. I am
indebted to Pat Hayes, Jeff Heflin, David Booth, Larry Masinter, Dan
Connolly, and especially Tim Berners-Lee for recent conversations
helping me understand these issues (even when they disagreed with me).

Last week at the DAML-PI meeting [3], TimBL said that we are not ready
to "float the currency" of identifier meanings yet, and wont be for
perhaps fifty years. For now, he argued, we need to stay on the gold
standard, where namespace owners have the non-negotiable right to
dictate the meanings of the terms in their namespace. This is like
the US Government saying a US "dollar" is worth 1/35th of a Troy ounce
of gold; it defines the US dollar in terms of other well-known
concepts. This makes sense when introducing a term; it makes less
sense when everyone has developed a strong sense of what the term
means. Tim's point, I think, was that we're a long way from computers
being able to navigate in a world of vague meaning.

***** 4. Argument For Entailment

Let's return to my Dog/Mammal example. Let's bind the namespace
"animals" to "http://www.w3.org/2002/10/meaning/animals#". The
document at that address (without the hash) is some RDF saying in RDFS
that animals:Cat is, in fact, a subclass of animals:Mammal.

Does this mean that the triple
_:x rdf:type animals:Cat.
entails
_:x rdf:type animals:Mammal.
?

There are some issues here about connectivity, trust, and
change-over-time, but let's defer them for the moment. Assume a
static, always connected, always trustworthy web.

Now, I claim that (following the "gold standard") the second triple
follows logically from the first. The author of the first chose to
use the "animals" namespace, and by doing so acknowledged the
definitions therein. The author could have used some other namespace,
or no namespace, but chose to use "animals" (by which I mean the
longer URI above). The author almost certainly chose to use the
"animals" namespace so that others, doing later queries or merges,
would connect his expressions with other expressions about animals.
He wanted us to be able to infer that _:x was a mammal.

Did he want us to follow the gold standard, or did he want us to have
to think carefully about which definition of animals to use? He
probably wanted us to use the gold standard, to use the definitions at
the namespace address, because otherwise there's a chance we'd believe
some foolish claim about cats being fishes, and totally misunderstand
him.

So yes, granted the issues about connectivity, trust, and
change-over-time, the above entailment should hold. Now, let's
address those issues:

***** 5. Answers to Problems

1. Connectivity. Connectivity does not affect entailment. Whether
or not someone can get a copy of the "animals" definition document
does not change the fact that that document is the primary source
for the definitions of all the terms in the animals: namespace.
If you can't fetch the definitions, then your knowledge of the
terms is incomplete and your reasoning about them will be
incomplete. Incomplete reasoning can be a problem, but it's
hardly a new problem or one which only arises when we bring in
connectivity issues. If you can't fetch the document (and don't
have a current cached copy) then you know that you're missing some
information. The monotonicity guarantee of RDF, however, allows
you to proceed with your partial information, which might be good
enough.

2. Trust (except for change-over-time). This gold standard means
that the claims of an RDF document (which [1] says should have
legal weight) depend on the contents of other documents. This is
more stable than saying such claims depend on social consensus,
but it still involves trust. If I say my dog has rdf:type
animals:Dog and the animals document says that an animals:Dog was
once kicked by Ebenezer Scrooge, can I really be held to be saying
that Scrooge committed such an act? I think so; I haven't found a
solid line marking the parts of a definition which have bearing
solely on other things. Perhaps the animals document means the
Scrooge clause to be the necessary and sufficient condition for
doghood! So, a bit hesitantly, we have to say that all statements
in the definition document are asserted by any use of terms from
the document.

We can address the Scrooge issue by saying that using terms from a
document is a lot like signing it. Don't do it unless you have
read the document and agree with it. Of course you need to do
this recursively, following the definitions of any terms it uses.

x. (x is for extra) This brings up the issue of URIRefs "grounding
out" in natural language text (which may well make use of
mathematical notation). Our "animals" document constrains the
meaning of animals:Dog (very slightly) by using the term
rdfs:subclassOf. That term needs to be constrained by the
document at the rdfs namespace [4], which it sort of is. To
follow the gold standard, that document must make normative
reference to "http://www.w3.org/TR/rdf-schema/" which it currently
does not. (We could exempt RDF and RDFS from this policy,
understanding that their meanings are acknowledged by the very use
of the RDF/XML data format. There is little reason for this
special dispensation.)

I don't see a proper way in the current spec to make this kind of
normative reference from an RDF/XML document to a human-readable
one. Perhaps it is sufficient for an rdfs:comment or
dc:description to claim, in its natural-language text, that it is
in fact normative. That's a little loopy, but natural language
can probably handle it. Better would be to make sure the RDFS
namespace document said that rdfs:comment contained true
natural-language statements about the subject.

3. Change-over-time is a special case of the "stewardship" issues. It
doesn't necessarily involve time; it's possible for a web server
to offer one definition document to people who seem to be in France
and another to people who seem to be in England.

Stewardship issues arise often: should one define one's input as
being Unicode 3.2 characters, or as being whatever characters set
is the latest approved by the Unicode Consortium? Do you
advertize your program as running on "OS Version 9.1" or "OS
Version 9.1 or later"? It all depends on whether you trust the
stewardship of the organization which controls the underlying
components.

The solutions here are typical security solutions, because these
are fairly typical security problems. How do you know the
definitions are the same ones you agreed to? (Secure hash
functions are a good approach.) If you agreed to ongoing updates
by some steward, how do you know the updates are actually coming
from that entity? (Public keys are a good approach.)

There's some interesting engineering to do here. The simplest
solution would be for each RDF document to give the SHA1 checksums
of each of its namespace documents -- if a checksum is missing or
does not match, the definition is considered to be unfetchable.
Something like:

<?xml version="1.0"?>
<!DOCTYPE rdf:RDF [
<!ENTITY animals "http://www.w3.org/2002/10/meaning/animals">
<!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns">
]>
<rdf:RDF xmlns:rdf="&rdf;#"
xmlns="&animals;#">
<rdf:Description rdf:ID="spot">
<rdf:type rdf:resource="&animals;#Dog" />
</rdf:Description>
<rdf:Description rdf:resource="&animals;">
<rdf:sha1>953365afbc5c24ecfe590c350ab1345bee2f7aee</rdf:sha1>
</rdf:Description>
</rdf:RDF>

It's not pretty; maybe someone has some better ideas.

The meaning of the SHA1 triple is a little tricky. It does not
mandate importing the URI's contents as one might imagine, because
(I propose) the RDF Specs already mandate it. Rather, it *allows*
it. Without the SHA1 triple, you would know there was some web
content which gave you further true information about the subject
at hand, but you would not be allowed to read it. With the SHA1
triple, if the content matches, you can go ahead and read and use
the additional content. Perhaps authors who don't want to bother
with SHA1 could add an alternate triple saying, in effect, I trust
any definitions you get from the URIs I use. (This might
be sufficient in any RDF document which is not cryptographically
signed.)

Eric Prud'hommeaux suggested the use of an HTTP header could allow
several documents to be served from the same URI, distinguished by
the SHA1 hash sent in the header. This would allow author and
namespace owner to negotiate (at read-time) on the exact
definition text to use, facilitating migration and
content-negotiation. This might be a nice feature, but it's not
necessary. This proposal, as is, works for entirely-static
definitions which is all we really need. (Since I calculated the
above checksum, I've twice resisted the urge to change the
definition in minor ways.) If we want to allow continuing
stewardship, some additional mechanism (such as a public key & URI
in the hashed static document) will be needed.

Another approach is my sdh proposal [5], but that's a bigger
change for RDF, and is not necessary if the official definition of
RDF is updated to include these semantics that the definitions of
terms are considered to be asserted.

y. (why not add an extra (rather philosophical) point?) I've been a
little vague about what a "definition" is. I mean a "definition"
to be some declarative statement which uses the term and is true
only for certain meanings of that term. An asserted (included,
imported) definition thus limits the possible valid
interpretations (models) of statements which use the term.

A "strong" definition is a work of art which constrains
interpretation to the point where no observable differences
emerge. For artificial terms, even stronger "perfect" definitions
can be written. These are definitions in the mathematical sense,
"Let us define f to be...". Compared to that, natural language
definitions and ontologies are usually mere descriptions. Still,
I call them definitional documents in accordance with their intent
and common usage.

Definitions do not have to be perfect, or even strong, of course.
They can be "thin" ontologies like my Dog/Mammal one, which merely
offer a little helpful description. The essense of the gold
standard is that, no matter whether a definition is thin, strong,
or perfect, you at least know which one everyone is supposed to
use.

***** 6. Older Conclusion

I've tried hard to be clear and concise here, and I apologize for any
failures. I understand you're working under a looming deadline, but
this issue is crucial to address as soon as possible, in this version
of RDF. I don't think this is a change in the basic intent of RDF,
but if you Recommend the MT in its current form, you will have given RDF
URIRefs only floating semantics.

I doubt the change from floating semantics back to namespace-document
semantics can be made compatibly. With floating semantics, people and
machines reading RDF are required to use their own judgement in
deciding which definitions to use. Once they start doing that,
authors will become used to it, and will no longer be obligated to
adhere to original definitions. Obligations cannot be imposed
retroactively (in this kind of a free environment), so if
namespace-document semantics are added later, they will have to be
added in a language which is marked as having different semantics.
But the difference is easy to miss; it's the difference that "now you
have to use the terms as defined!" and if there's a reasonable doubt
about authors understanding this change, then they really have no
obligation (such as might stand up in court), and the change has not
actually been made.

Since floating semantics are not amenable to automated reasoning, if
you pass on this issue now, you will have kept RDF (in its present
form and probably all similar future forms) from being a viable
Semantic Web language. That would be unfortunate.

If there is any further way I can assist in this matter, please let me
know.

-- sandro http://www.w3.org/People/Sandro/

[0] http://www.ninebynine.org/wip/RDF-concepts/2002-10-18/rdf-concepts.html
[1] http://www.w3.org/TR/rdf-concepts/#section-Meaning
[2] http://www.w3.org/TR/rdf-mt/#urisandlit
[3] http://www.daml.org/meetings/2002/10/pi/
[4] http://www.w3.org/2000/01/rdf-schema
[5] http://www.w3.org/2002/09/sdh/
[6] http://www.w3.org/2002/08/LX

Received on Thursday, 24 October 2002 13:05:55 UTC