RDF's Mixed-Mode Identifiers

Let me try again to explain what I now think is broken in RDF's use of
URI-References (and how to fix it with very little pain).  Forgive me
for starting with the obvious stuff, but it seems necessary.

1.  Addressable Locations

The web is a distributed system in which computer systems cooperate to
present users with discrete chunks of individually addressable
information, usually called "web pages".  Each chunk is maintained
(or all-too-often fails to be maintained!) at a virtual location; that
location has an address which people use to access the information.

Pointing to locations is what makes the web work: people click on
underlined text or icons to go there.  They can read an address on the
side of a bus or a carton of milk, type it in, and see whatever the
location's owner has published there.  Search engines and web catalogs
can scan and index the sites, by their addresses, and then help people
find the pages they want.

2.  Fragment Addresses

In presenting information on the web, authors and web designers face a
choice about chunk-size.  If they make the chunks too big, how can
they point people to the right information?  If an event is being
advertized, the address should lead directly to information about the
event, not to an overwhelming list of events.  If the the chunks are
too small, on the other hand, each page will be unable to convey a
coherent concept: users ariving at the page need to know some
background, and search engines need enough content to perform proper
indexing.

One web feature to help with this dilemma is the "fragment" address: at
   http://www.w3.org/TR/REC-xml 
one can see the XML specification, while at 
   http://www.w3.org/TR/REC-xml#sec-pi
one can see the part of the specification covering XML "processing
instructions."  The "#sec-pi" part tells your browser that after
fetching the information, it should jump to the part labeled "sec-pi"
in the internal markup.  If the information is presented in small
enough chunks, fragment addresses are superfluous, but with big
documents (like the XML spec!) they can be very useful.

3.  Identifying Things

In RDF (and other knowledge representation languages) we want to
formally convey information about all sorts of things: people, places,
times, mathematical functions, numbers, emotions, qualities, prices,
and (of course) books.  We also want to talk about web sites and web
pages.  How should we use the web's existing infrastructure to help us
identify all the things we want to talk about?

3.1.  Non-clickable links

The simplest answer is to use strings which look like web addresses
but don't really lead to web pages.  These could be UUIDs, tag: URIs,
or even http URLs which are not properly served.  All of these
approaches let people generate a string with confidence that no one
else will accidentally generate the same string; strings like this
serve to unambiguously identify things, but the connection between the
string and the thing is not expressed in the web.  This works (and was
the first approach I liked), but it does not really use the web.

3.2  Reusing the Fragment Syntax

Another approach is to generalize the fragment syntax.  The semantics
of address#fragment are not fully specified in existing standards,
mostly because the meaning of a fragment depends logically on the
language in which the information-chunk is being conveyed.  Pointing
into a text document is different from pointing into an audio
recording or a 3-D image.  To leave the door open for new formats, RFC
2396 says the semantics of an address with a fragment part depend on
the media-type of the content served at that address.

This open door allows us to define an RDF media type (application/rdf+xml) 
where "fragments" are not fragments, but rather arbitrary things.  When
we say "http://www.w3.org/1999/02/22-rdf-syntax-ns#type" we do not
mean some part of the document at that address; we mean some abstract
concept of a type-relation, because that document is an RDF one.

Do we need to know the media type of the document?  Some people say
not, that the use of that string as an RDF node or arc label is not
governed by RFC 2396; RDF stands on its own and can use URI-like
strings in its own way.  This may work, but as with UUIDs, it fails to
use the web very well.  Moreover, its dilutes the power of URIs: any
string on the planet which starts with "http://" and does not work as
a web address is a wasted opportunity for communication and another
chance to confuse and disappoint people.   We can do better.

Reusing the fragment syntax also causes a few technical problems.
What happens if the content at the given address is NOT only
application/rdf+xml?  Maybe that's just a misconfigured system, but it
could be a useful one.  I think it would be nice for existing browsers
to get human-readable HTML at the same address where an RDF-capable
client gets its information.  Like other forms of content-negotation,
this allows all the forms of addressing (links, advertising, search
engines, etc) to index the information itself, regardless of its
presentation format.

Even if people choose not to use content negotation, there is still a
strong and growning view that the fragment syntax is used for, well..,
fragments.  RDF/XML documents are XML documents, and IMHO the XML
community rightly expects XML's fragment syntax (xpointer) to work
consistently.  As with media types, fragment-syntax reuse may skirt
the letter of the law here, since since foo#bar only means the XML
element with the ID "bar" when served with certain media types, and
the rdf:ID attribute isn't realy an XML ID, but it seems to me that
RDF/XML is running (unnecessrily!) against the spirit of both XML and
web addressing.

3.3  Using Descriptive Web-Content

A third approach is to say that when a web page is about one thing, we
can use the page's address as a kind of identifier for that thing.
If you visit
   http://www.w3.org/Consortium/
you'll see it is clearly a page about the W3C.  We can use that to
identify the W3C itself, calling the W3C "the subject of
http://www.w3.org/Consortium/". 

This is not the same as saying "http://www.w3.org/Consortium/ is a
Consortium."  That's like pointing at a photographic image of the
Eiffel Tower and telling someone "that's the Eiffel Tower!": it works
perfectly well with humans, but it introduces more ambiguity than we
want in machine processing.  (Some humans, of course, might take the
opportunity to be pedantic and point out "No, it's a PICTURE of the
Eiffel Tower."  Some of us try hard not to be like that.)

This approach to identification makes excellent use of the billions of
existing web pages and the pointers to them throughout our world.
Here we say that IF there's a page which has a single, conspicuous
subject, we can use that page to held identify the thing.  If we want
an identifier for something, we can find a page, make a page, or even
just allocate an address for the page.

If someone sees such an identifier, their browsers stands a good
chance of explaining to them what object is being identified AND
telling them some useful information about it.  (Using content
negotiation or a page of mixed HTML and RDF, I would hope the web
server would communicate its informaton to an RDF-aware application in
RDF/XML via that same address.)

This approach uses existing web page, existing search engines,
existing retrieval mechanisms, and existing social practice to
strongly connect identifiers, the things they identify, and
information about the identified things.

4.  Node Labels (Subject, Container, Overloaded, and Distinguished)

The challenge to using descriptive web-content to identify things is
that we risk confusing the page with its subject.   If we just label
an RDF node "http://www.w3.org/Consortium/" who knows if we are
talking about a web location or an industry consortium?  

I suggest that ideally we would have two kinds of labels, which I'll
call "Subject" and "Container" labels.  A node with the Subject label
of "http://www.w3.org/Consortium/" represents a consortium; we would
expect to see arcs from it saying, perhaps, that its director is Tim
Berners-Lee.  A node with a Container label with the same text
represents the web location itself, a container for some information;
from it we might find arcs saying its last-modify date was "Wed, 13
Nov 2002 21:57:38 GMT".

As I read the working drafts and look at current usage, RDF currently
has neither subject nor container labels.  I see two interpretations
for what it has now, which I call "overloaded" and "distinguished"
labels.  

A node with the "overloaded" label "http://www.w3.org/Consortium/"
represents both a web location AND an industry consortium.  This use
is both absurd and generally workable.  It works because the RDF arcs
to and from the node are likely to treat it as one or the other; it
being both will never be noticed by most users.  My strongest argument
against this practice is that it flies in the face of accepted system
design methods: one should classify objects in the problem domain
according to their qualities as perceived by the people who work with
them.  The idea that something could be both a web site and a
consortium hardly seems natural to my small, biased, expert sample
(myself).

A mode with the "distinguished" label "http://www.w3.org/Consortium/"
represents a web location.  Distinguished labels are considered to be
subject labels when they contain a "#" character and container labels
when they do not.  I think the vast majority of deployed RDF uses
distinguished labels.  The likely trouble spots are when the RDF graph
is conveying information about web page fragments (eg EARL, and
Annotea in some versions) and when RDF authors have chosen not to use
fragment syntax for abstract concepts (eg Dublin Core).  The DC
situation as somewhat eased by considering RDF to use only subject
labels for arcs (but distinguished labels for nodes).

5.  Delabeling   (x:uriRef and x:primarySubject)

So how can we talk, in RDF, about web-page fragments and about things
which are the subject of entire an web page?  Distinguished labels
don't allow either of these.  Talking about fragments is important in
at least Annotea and EARL; talking about the subjects of entire web
pages allows vastly more of the existing web to be used for
identification purposes.

The most practical approach, I think, is to extend the concept of node
delabeling.

Conventional node delabeling turns 
   <foo> <bar> <baz>.
into something like:
   _:a <bar> _:b.
   _:a x:identifier "foo".
   _:b x:identifier "baz".

where x:identifier is a property of something linking it to a string
which is an unambiguous identifier for it. 

We need to extend this to handle our different kinds of labels.  I
suggest x:uriRef for container labels and x:primarySubject (which
points the other way) for subject labels.   Thus:

    _:SomePageAboutW3C x:uriRef "http://www.w3.org/Consortium/".
and
    _:SomePageAboutW3C x:primarySubject _:W3C

If we want to reverse the direction of x:primarySubject, we could
perhaps call it x:descriptionURIRef, but I'm not as fond of that
name.   x:uriRef is an owl:InverseFunctional property while
x:primarySubject is an owl:Functional one.

6.  Conclusion

It might be nice to have subject and container labeling throughout
RDF, but it's late in the game for that.  Instead, RDF Core should:

  1.  Define RDF labeling as being distinguished labeling
      for nodes and subject labeling for arcs.
 
  2.  Define/recommend x:uriRef for talking about fragments of web
      pages (and whole web pages, when desired).

  3.  Define/recommend x:primarySubject for talking about things
      which are the subject of an entire web page (and fragments, when
      desired).

People should then use them.  :-)

[I'm sending this to rdf-interest instead of rdf-comments because I'd rather
get any bugs in this proposal worked out by interested parties before
bothering the WG.]

    -- sandro

Received on Monday, 23 December 2002 18:02:27 UTC