Re: URIs / URLs from Charles McCathieNevile on 2001-04-10 (www-rdf-interest@w3.org from April 2001)

From: Charles McCathieNevile <charles@w3.org>
Date: Tue, 10 Apr 2001 10:17:36 -0400 (EDT)
To: Lee Jonas <ljonas@acm.org>
cc: <www-rdf-interest@w3.org>
Message-ID: <Pine.LNX.4.30.0104070944530.23694-100000@tux.w3.org>
On Fri, 6 Apr 2001, Lee Jonas wrote:

  Yet the task of keeping your own links current is a real problem,
  particularly for sites with large numbers of them such as search engines.
  URLs are not without their problems.

Hmm. Search engines don't break URI's as a rule - they have well defined URIs
such as http://search.example.org/find?you+me with well defined semantics
such as "The ten links in the database that the search criteria propose as
most relevant tothe terms 'you' and 'me". Search engines don't keep a whole
lot of old state information, so they don't tend to make available a URI that
means "The ten links most relevant to the terms 'you' and 'me' according to
the state of the database at time 19930405210456+1000" but there is no reason
why they can't do that very happily using existing URI syntax.

The problem search engines have is that other people publish URIs thinking
they are temporary locations, not identifiers, so they change them. And break
the part of the web that relied on them.

  I dread to think how many man hours and CPU cycles have been spent around
  the globe ensuring links to other pages still work.  Additionally, the web
  is always 'current'.  There is no formal concept of version, the W3C and
  most other self-respecting organisations implement the necessary measures
  themselves (for representing Working Draft versions) in an adhoc fashion.

A universal identifier syntax should not include a specification of version
information - versioning is complex and use-dependent, whereas being able to
give a name (any name) to something is the sort of simple, powerful tool that
makes things like the Web work.

  The notion of URLs identifying representations seems a little trite to me.
  It indicates the nature of the true problem, without fully addressing it: a
  resource at the end of a location is not consistent over time.

True. But a resource that has an identifier can reasonably be expectedto keep
that identifier. After all, I am not the same person I was two years ago, but
'Charles Cavendish McCathieNevile' is generally reckoned to be an appropriate
identifier for the collection of skin and bones that writes this. In order to
disambiguate this for people who have fairly common names that can't be
distinguished by a computer system we can have schemes like
mailto:me@aaronsw.com where we rely on a few simple features to keep this
disambiguated.

On the other hand, the identifier
"Resident of 14 Fulkerson st #2 Cambridge MA 02141 USA" used to refer to me
and doesn't any more. That doesn't make it a bad identifier, just one with
different semantics. http://www.w3.org and http://www.w3.org/Overview.html
have different semantics too.

  This is for at least two good reasons: resources evolve, and resources move
  / disappear / or worse, a second resource ousts the first at a particular
  location.

This is simply a problem of bad implementation. See below for a further
discussion.

  The first issue could have been addressed more formally (and hence
  consistently) with a simple versioning scheme.  This would have alleviated
  the problem of instantly breaking third party links (or invalidating
  metadata semantics) when you change a resource.  Yes your links must change
  to reflect new versions of things you reference, but these changes could be
  a graceful migration, not an abrupt crash.

systems such as cvsweb have a conceptof versioning because they want to
publish versions. things like http://www.w3.org/WAI/AU does not readily
identify versions because the publisher does not want to publish different
versions. Our metadata systems for identifying what kind of resource a given
URI is are still remarkably primitive, and are very important.

  The second is the main bugbear of using a resource's location to identify
  it.  This phenomenon is well known in distributed object technology.
  Superior solutions leave the actual resolution of an object's location to
  some distributed service when a client wants to interact with it.

For example we could break a resource into 3 parts, and leave resolution of
the first to an international, network wide service like DNS, the second to a
local system like a web server, with the resolutionof the final part up to
the client...  Seriously, I think that you are confusing the resolution of a
resource with the naming of a resource.

  These are compounded with the fact that the resource can be one of many
  formats and there is no clear way to distinguish them from the URL iself.  A
  resource such as http://mydomain/mypic.png may safely be assumed to be a png
  graphic, but what about the resource at the end of http://mydomain/mydir/ ?
  Mime types have become pervasive for identifying a resource's type, yet URLs
  predate MIME by years.  If you want to know its type you have to make a
  request to some server process.

The assertion is false. It is only Windows that thinks it is a valid
statement. There are ways of stating in RDF that a resource has a given MIME
type, or even can have several MIME types - for example a jpeg image might be
avilable as image/jpg or as text/rdf according to what you are actually
trying to get. There are unique identifiers for each of these forms, and
there are URIs for the resource that has different forms. (This is how the
W33C site is set up, anyway, but it can be done on all kinds of systems). A
similar issue arises with language negotiation.

  It may be that URLs have been sufficient so far because the Internet is
  predominently for human consumption, and identifying abstract resources that
  cannot be retrieved electronically has been pretty pointless to date.
  However, aspects of the semantic web vision (plus other technologies like
  WebServices, i.e. .Net?) could add a couple of new issues to the pot:

  1) It may become more common to reason about abstract resources whose
  identifiers may not be readily representable as a location.  It would be
  better to identify these with a URN.  Hence URNs may be more widely used
  than at present.

URI does not specify a location - HTTP specifies a way of retrieving
something that has  URI. And the syntax allows pretty much anything to be
named (incidentally so does the syntax for rational numbers, although it is
even less user-friendly). The problem is not a technical one but a social one
of convincing people not to change the meaning of identifiers. Using a new
set of identifiers simply passes the buck again.

  2) Encouraging reuse of RDF schema resources is antagonised by the need for
  schema owners to update them.  Currently the proposed RDFS solution is very
  MS COMesque - every schema update is a new resource that SHOULD be constant
  and available in perpituity.  IMHO this approach is inferior to specifying a
  common versioning scheme, which when coupled with a standard change
  management process, allows related schemas to migrate and evolve in tandem.
  After all, that's exactly what happens with dependencies between software
  components now.

It is perfectly possible for RDF to describe complex versioning, such as the
fact that some parts of one schema haven't channged but others have, and even
complex inter-relationships that are byond the average linear versioning
system.

  3) Data quality will be poorer if it is hard for software to detect a
  resource change.  Transience is bad news if you are going to store facts
  about something that subsequently changes.

Yes

  What the solution to all this is I don't know.  I just can't help feeling
  that as the semantic web progresses things are about to get a lot more
  complicated unless these issues are addressed.

The existing web suffers because people change the meaning of URIs (for
example from "my homepage" to "there is no resource here - 404..."). Dealing
with the existing problem will help the old world of the web as well as the
new, and allow us to integrate it nicely. Shifting the new world to a
different space fragments it but doesn't of itself provide a solution.

(Just my .02 <insert currency>)

cheers

Charles McCN

  Regards

  Lee



  Dan Connolly <connolly@w3.org> wrote:

  Pierre-Antoine CHAMPIN wrote:
  [...]
  > HTML and PDF version available at
  >   http://www710.univ-lyon1.fr/~champin/urls/

  This document promulgates a number of myths about
  Web Architecture...

    "However, other W3C recommendations use URLs to identify
    namespaces [6,8,11]. Those URLs do not locate the
    corresponding namespaces (which are abstract things and
    hence not locatable with the HTTP protocol), "

  I don't see any justification for the claim
  that namespaces are disjoint from HTTP resources.
  One of the primary motivations for the
  XML namespaces recommendation is to make the
  Web self-describing: each document carries the
  identifiers of the vocabularies/namespaces it's written in;
  the system works best when you can use such
  an identifier to GET a specification/description
  of the vocabulary.

    "For example, the URI of the type property of RDF is
    http://www.w3.org/1999/02/22-rdf-syntax-ns#type. As a
    matter of fact, the property itself is not located by that
    URL: its description is. "

  Again, I see no justification for the claim that this
  identifier doesn't identify a property.

    "URLs are transient

    That means they may become invalid after a certain period
    of time [9]. "

  That's a fact of life in a distributed system. URNs may
  become invalid after a period of time too. It's true
  of all URIs. URIs mean what we all agree that they mean.
  Agreement is facilitated by a lookup service like HTTP.
  In practice, URIs are quite reliable:
  6% linkrot according to http://www.useit.com/alertbox/980614.html
  and I think the numbers get better when you measure
  per-request rather than per-link,
  since popular pages are maintained
  more activly than average.

  Unless urn: URIs provide value
  that http: URIs do not, the won't be deployed.
  I think the fact that

   (a) urn:'s have been standardized
   (IETF Proposed Standard, that is) since 1995
   (b) support for them is available in popular browsers
   and has been for several generations
  and yet
   (c) still their use is negligible

  speaks for itself. They don't provide any value.
  Naming is a social contract, and the http: contract
  works and the urn: contract doesn't.


    "In the immediate interpretation, a URL identifies the
    resource retrieved through it."

  to be precise: it identifies the resource accessed
  thru it. In the general case, you can't retrieve
  a resource, but only a representation of one.
  Other text that makes this error includes:

    "... the retrieved resource ..."

  Another falsehood:

    "Contrarily to URLs, URNs (Uniform Resource Names) are
    designed to persistently identify a given resource."

  URIs in general are designed to persistently identifiy
  a given resource. Especially HTTP URIs.


  I recommend a series of articles by the designer
  of URIs, HTTP, and HTML to clarify a number of
  these myths:

    World Wide Web Design Issues
    http://www.w3.org/DesignIssues/

  esp

    The Web Model: Information hiding and URI syntax (Jan 98)
    http://www.w3.org/DesignIssues/Model

    The Myth of Names and Addresses
    http://www.w3.org/DesignIssues/NameMyth

    Persistent Domains- an idea for persistence of URIs(2000/10)
    http://www.w3.org/DesignIssues/PersistentDomains

    (Hmm... this one is an interesting idea, but I think freenet:
     might be easier to deploy.)

  and regarding the intent of the design of namespaces, see:

    cf Web Architecture: Extensible Languages
    W3C Note 10 Feb 1998
    http://www.w3.org/TR/NOTE-webarch-extlang


-- 
Charles McCathieNevile    http://www.w3.org/People/Charles  phone: +61 409 134 136
W3C Web Accessibility Initiative     http://www.w3.org/WAI    fax: +1 617 258 5999
Location: 21 Mitchell street FOOTSCRAY Vic 3011, Australia
(or W3C INRIA, Route des Lucioles, BP 93, 06902 Sophia Antipolis Cedex, France)
Received on Tuesday, 10 April 2001 10:18:00 UTC