Using HTTP URIs *directly* to identify people/cars/etc is wrong from Sandro Hawke on 2003-02-08 (www-archive@w3.org from February 2003)

From: Sandro Hawke <sandro@w3.org>
Date: Sat, 08 Feb 2003 13:42:06 -0500
To: www-archive@w3.org
Message-Id: <200302081842.h18Ig7A01874@wadimousa.hawke.org>
Have I got this right yet?

   In the view of RFC 2396 and RFC 2616, each URI directly identifies
   one thing, its identified "resource".  These specifications keep
   the definition of "resource" and the nature of the identification
   relationship very abstract in order to keep the field open for
   unforseeable future protocols and applications.  While it may have
   been tempting to define URIs more narrowly, saying perhaps that
   they directly point to living documents, such an approach might
   have prohibited novel applications such as those involving
   streaming media, mobile code, cookies, and web services.  So an
   abstract definition was used and the web has kept evolving.

   Unfortunately, the abstractness in the definition of "resource" has
   led some people to think it was reasonable to identify people,
   products, organizations, physical objects, etc, with http URIs.
   RFC 2396 makes it clear that URIs in general can be used like this,
   but RFC 2616 and the HTTP protocol are not meant to be used this
   way.  HTTP URIs are intended for use with the HTTP protocol, which
   is a particular data transfer protocol.  The reason to use an HTTP
   URI is that it can be used with the HTTP protocol.

   It is tempting to say that an HTTP URI like
   "http://www.w3.org/People/EM" can identify a person.  In a loose,
   natural language sense, this string does identify Eric Miller.
   Similarly, in a loose natural language sense, the MIME entity
   returned in a successful HTTP GET transaction "represents" Eric
   Miller.  It has his picture, and Merriam-Webster's first definition of
   "representation" is "an artistic likeness or image".  But these
   meanings of "identify" and "respresent" are not in the technical
   sense meant by the HTTP specifications.

   The temptation is strong, because if you identify Eric with a URI
   like "http://www.w3.org/People/EM", you can easily use HTTP to get
   information about him.  By the same token, if you identify the RDF 
   type property with a URI like
   "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", you can easily
   use HTTP to get ontological information about it.

   Unfortunately, this use of URIs is not in keeping with existing web
   technologies.  Everything from bookmark editors and search engines
   to human institutions like ad agencies treat HTTP URIs as
   identifying something like a source of information, a virtual place
   to visit, or something you can talk to.  These uses of URIs are at
   the heart of HTTP -- we wanted to use HTTP URIs for identifying
   Eric Miller and the RDF type property so we could easily obtain
   some information about them! -- and they are fundamentally different
   from the use of URIs to identify things like physical objects.

   The dictionary notion of representation might tempt people to argue
   that even if "http://www.w3.org/People/EM" does not identify Eric
   Miller, at least "http://www.w3.org/People/EM/e_miller.jpg" does.
   Clearly, that URI leads to a picture which represents him.  But
   this approach breaks down when you consider that HTTP GET of
   "http://www.w3.org/Team/EM/s000782" also gives you a picture of
   him.  They both represent Eric, but they are different pictures.
   They show him in different poses, and (as you may have noticed)
   they are published on the web differently; access to the later one
   is tightly controlled.  If the resource identified by each of these
   URIs was Eric Miller, we could not use the URIs as identifiers to
   talk about the differences here.  The correct view is that the URIs
   identify systems which offer a pattern of HTTP responses; the
   second resource uses access control while the first does not.
   Those systems and those responses bear an interesting relationship
   to Eric Miller, but they are themselves things worthy of discussion
   and so of being identified.

   TimBL has argued that this view is correct for non-fragment URIs,
   but that URIs with a fragment part are different.  I disagree,
   because I think HTTP fragment URIs, even though unused in HTTP
   itself, still identify information sources.  We often footnote
   discussions with fragement-URIs to say, in effect, my point is
   supported by _this_ _part_ of some document.  That part of the
   document is a source of information, like a whole document, which
   may be bookmarked, linked-to, indexed, and even used in
   advertisements.

   So how can we identify Eric Miller and still have ready access to
   his web page?  The answer is that when we use a URI as a name for
   something, we should be clear whether we mean it to operate
   directly as a web address or indirectly as (in topic maps
   terminology) a subject indicator.  When a URI appears as an xmlns
   value, an HTML profile identifier, an HTTP extensions identifier
   (RFC 2774), or as an RDF predicate, it clearly is operating as a
   subject indicator.  We know this because, among other things, the
   application operates normally even when HTTP access to the resource
   is impossible.  Another sign is that implementations compare such
   URIs on a character-by-character basis, not even folding case in
   the scheme name.

   In fact, about the only time this dual use of URIs as web addresses
   and subject indicators is even noticable is in RDF node labels.
   When we have RDF triples like this example in the RDF Primer [1]

   <http://www.example.org/index.html>
                <http://purl.org/dc/elements/1.1/creator>
                                <http://www.example.org/staffid/85740> .  

   we can guess the first URI is being used as a web address while
   the third is being used as a subject indicator, but such a
   determination is not always possible.  I suspect, given its PICS
   heritage, that in early uses of RDF the node labels were always
   intended in as web address -- this was information about the
   relationships between web pages -- but I don't know.   Somewhere
   along the line, people got tempted by the wording of RFC 2396 and
   their own desire to make a more useful system, and started they
   started to lose the distinction.

   It has been suggested that type inference can serve to
   disambiguated triples like the one above.  If the range of
   dc:creator were Person, then we would know
   "http://www.example.org/staffid/85740" was being used as a subject
   indicator.  That might work, sometimes.  But type inference cannot
   always help.  Imagine a work of art, a sculpture with a URL
   engraved in its base.  At that address, the sculptor maintains a
   website about the work.  If that URL is
   "http://www.example.org/index.html", then does the above triple
   tell us about the creator of the sculpture or the creator of the
   website?  If we defined dc:websiteCreator, which could only be used
   to tell us about the creator of a website, we would be in the same
   mess if we came across a website about a website.  In some cases,
   no amount of information about a URI can tell us whether, in a
   given occurance in RDF triple, it is meant to be used as a subject
   indicator or a web address.

   RDF documentation should be clear: it should use the word
   "resource" only when talking about the thing immediately identified
   in an RDF 2396/2616 sense.  A physical object cannot be an HTTP
   resource, in this sense.  It could be a resource using some other
   URI scheme/URN NID, like urn:oid, urn:uuid, or tag:.

   In terms of the actualy syntax and semantics, some solutions for
   RDF include:

     - In the abstract syntax, say that URIs label nodes in one of two
       ways (web address and subject indicator); in the concrete
       syntax imagine rdf:about and rdf:resource being combined into
       one linking attribute, but then split that into rdf:webAddress
       and rdf:subjectIndicator.  (Or perhaps something like
       aboutWebPage, aboutIndicatedSubject, identifiedResource, and
       indicatedSubject.   The names will take a little work.)

     - Alternatively, deprecate URI node labels in one or both modes
       of identification, while introducing RDF properties webAddress
       and subjectIndicator.

     - A third option is my http://www.w3.org/2002/12/rdf-identifiers
       proposal where *in* *RDF* the "#" is seen as a flag indicating
       which style of URI use is intended.   This is a
       backwards-compatibility hack to avoid needing to change or
       deprecate current uses.

   
[1] http://www.w3.org/TR/rdf-primer/#rdfmodel
Received on Saturday, 8 February 2003 13:43:50 UTC