RE: URIs / URLs

Charles McCathieNevile [mailto:charles@w3.org] wrote:
>
>On Fri, 6 Apr 2001, Lee Jonas wrote:
>
>  Yet the task of keeping your own links current is a real problem,
>  particularly for sites with large numbers of them such as search engines.
>  URLs are not without their problems.
>
>Hmm. Search engines don't break URI's as a rule - they have well defined
URIs
>such as http://search.example.org/find?you+me with well defined semantics
>such as "The ten links in the database that the search criteria propose as
>most relevant tothe terms 'you' and 'me". Search engines don't keep a whole
>lot of old state information, so they don't tend to make available a URI
that
>means "The ten links most relevant to the terms 'you' and 'me' according to
>the state of the database at time 19930405210456+1000" but there is no
reason
>why they can't do that very happily using existing URI syntax.
>
>The problem search engines have is that other people publish URIs thinking
>they are temporary locations, not identifiers, so they change them. And
break
>the part of the web that relied on them.
>

Your second para is the problem I am referring to:  maintaining links to
other people's resources.  Search engines in particular maintain a lot of
links to other peoples resources.


>  I dread to think how many man hours and CPU cycles have been spent around
>  the globe ensuring links to other pages still work.  Additionally, the
web
>  is always 'current'.  There is no formal concept of version, the W3C and
>  most other self-respecting organisations implement the necessary measures
>  themselves (for representing Working Draft versions) in an adhoc fashion.
>
>A universal identifier syntax should not include a specification of version
>information - versioning is complex and use-dependent, whereas being able
to
>give a name (any name) to something is the sort of simple, powerful tool
that
>makes things like the Web work.
>

Versioning does not have to be complex (e.g. resource modified number,
resource appended-to number, resource revision number), though it can be use
dependent.  It would have been nice if resource version and format were
formally captured in URI protocols in the same way that 'port' is.  I.e. it
is not part of the identifier (so different versions / formats are not
distinct resources); a specific version /format can be specified; the
default version is the latest.

This is fairly academic as URI protocols are well established as is.

>  The notion of URLs identifying representations seems a little trite to
me.
>  It indicates the nature of the true problem, without fully addressing it:
a
>  resource at the end of a location is not consistent over time.
>
>True. But a resource that has an identifier can reasonably be expectedto
keep
>that identifier. After all, I am not the same person I was two years ago,
but
>'Charles Cavendish McCathieNevile' is generally reckoned to be an
appropriate
>identifier for the collection of skin and bones that writes this. In order
to
>disambiguate this for people who have fairly common names that can't be
>distinguished by a computer system we can have schemes like
>mailto:me@aaronsw.com where we rely on a few simple features to keep this
>disambiguated.
>
>On the other hand, the identifier
>"Resident of 14 Fulkerson st #2 Cambridge MA 02141 USA" used to refer to me
>and doesn't any more. That doesn't make it a bad identifier, just one with
>different semantics. http://www.w3.org and http://www.w3.org/Overview.html
>have different semantics too.
>

I would argue, strictly speaking, that mailto:charles@w3.org identifies your
mailbox at the W3C, not the 'collection of skin and bones' named Charles
Cavendish McCathieNevile.

I have two mailboxes I use frequently, mailto:ljonas@acm.org &
mailto:lee.jonas@cakehouse.co.uk.  I'd rather not have to pick one and only
one to identify me.  Furthermore, what happens when you leave the W3C?
mailto:charles@w3.org becomes invalid (assuming the W3C does not recycle it
for use by another Charles, heaven forbid!) and your identity changes to
mailto:charles@whatever.com.  All previous references to
mailto:charles@w3.org are now broken, or worse mailto the wrong person.

Now, the 'mailto:charles@w3.org' mailbox can be manipulated electronically -
i.e. directly over the Internet by some software process - whereas 'Charles
Cavendish McCathieNevile' cannot.

How can we programatically differentiate resources that are accessible
electronically and those that are not?  One suggestion is use URLs for the
former, URNs for the later.

At the moment, according to stats from DanC, 6% (approx 1 in 20) of URLs are
not vancable / dereferencable.  I would imagine that most of these are just
broken.  If the W3C encourages use of URLs for resources that are not
electronically accessible (e.g. non-existent schemas, abstract concepts,
individuals, etc) then this will only increase.  At what point does the
perversion of using locations to identify unlocatable resources become
unacceptible?  When 20% (1 in 5), 30% (1 in 3), 40% (2 in 5) of URLs are not
vancable?

>  This is for at least two good reasons: resources evolve, and resources
move
>  / disappear / or worse, a second resource ousts the first at a particular
>  location.
>
>This is simply a problem of bad implementation. See below for a further
>discussion.
>

To which implementation are you referring?  The general implementation of
URLs or people's management of their own domain spaces?  As you are
defending URLs as they currently stand, I assume you mean people's
management of their own domain spaces.

IMHO, for the most part, the URL implementation is good.  

>  The first issue could have been addressed more formally (and hence
>  consistently) with a simple versioning scheme.  This would have
alleviated
>  the problem of instantly breaking third party links (or invalidating
>  metadata semantics) when you change a resource.  Yes your links must
change
>  to reflect new versions of things you reference, but these changes could
be
>  a graceful migration, not an abrupt crash.
>
>systems such as cvsweb have a conceptof versioning because they want to
>publish versions. things like http://www.w3.org/WAI/AU does not readily
>identify versions because the publisher does not want to publish different
>versions. Our metadata systems for identifying what kind of resource a
given
>URI is are still remarkably primitive, and are very important.
>

Yes, and a scheme like the one I outlined above would allow you to pick and
choose whether you wanted to publish different versions or not, with the
added benefit that versioning is consistent accross the Web, and different
versions of the same resource are not represented using different
identifiers (merely requested a la 'port' semantics), hence they are easier
to deal with programatically.

>  The second is the main bugbear of using a resource's location to identify
>  it.  This phenomenon is well known in distributed object technology.
>  Superior solutions leave the actual resolution of an object's location to
>  some distributed service when a client wants to interact with it.
>
>For example we could break a resource into 3 parts, and leave resolution of
>the first to an international, network wide service like DNS, the second to
a
>local system like a web server, with the resolutionof the final part up to
>the client...  Seriously, I think that you are confusing the resolution of
a
>resource with the naming of a resource.
>

You are describing the steps undertaken for accessing a resource in general.
The part I am referring to is how does the local system locate a resource
(part 2 of the process you outline) - via a location (e.g. file path) or by
looking up a locally unique id to determine a location (e.g. file path), if
any.

Consider DNS.  This does the same task for addressing machines.  An
identifier, 'www.w3.org', is dynamically resolved by a distributed
resolution service, 'Distributed Naming System', to an address,
'193.51.208.68', which tells you how to access a resource: the hardware to
direct http/ftp/telnet/etc requests to.

My suggestion was to extend the principle beyond addressing hardware to
addressing Internet resources in general.  As it turns out, a group of
American Publishing Organisations got together to do just that!  See
http://www.doi.org.

>  These are compounded with the fact that the resource can be one of many
>  formats and there is no clear way to distinguish them from the URL iself.
A
>  resource such as http://mydomain/mypic.png may safely be assumed to be a
png
>  graphic, but what about the resource at the end of http://mydomain/mydir/
?
>  Mime types have become pervasive for identifying a resource's type, yet
URLs
>  predate MIME by years.  If you want to know its type you have to make a
>  request to some server process.
>
>The assertion is false. It is only Windows that thinks it is a valid
>statement. There are ways of stating in RDF that a resource has a given
MIME
>type, or even can have several MIME types - for example a jpeg image might
be
>avilable as image/jpg or as text/rdf according to what you are actually
>trying to get. There are unique identifiers for each of these forms, and
>there are URIs for the resource that has different forms. (This is how the
>W33C site is set up, anyway, but it can be done on all kinds of systems). A
>similar issue arises with language negotiation.
>

Strictly speaking, I agree.  However, it is done this way by many real world
applications, and not just on Windows.

With Windows you can also include both KDE and GNOME - two UNIX Window
Management systems.  Both are primarily available on Linux, and I believe
Sun have announced that GNOME will become the main Window Manager on
Solaris.

What is the mime type associated with an HTTP request to get
http://mydomain/mydir/ ? It could be a directory listing (text/html) or the
http request could be processed by a 'default handler' that could return a
representation in any arbitrary mime type.

I consider using unique identifiers for different formats inferior to a
single identifier with a way to request the resource in different formats.
Consider making (third-party) RDF Statements about that image.  

The way it is:
I have to group the image/jpg & text/rdf resources together and make
statements about the group.  Note this does not indicate that they are in
fact different formats of the same thing, merely that they are treated
collectively for the purpose of stating facts about them.  What happens when
a image/png format of the same image is added?  My metadata now makes
statements about the image/jpg and text/rdf formats, but not the new
image/png format.

The way it could be:
I can make statements about a resource regardless of its available formats /
versions, or else make statements about specific formats / versions, the
choice is mine.

>  It may be that URLs have been sufficient so far because the Internet is
>  predominently for human consumption, and identifying abstract resources
that
>  cannot be retrieved electronically has been pretty pointless to date.
>  However, aspects of the semantic web vision (plus other technologies like
>  WebServices, i.e. .Net?) could add a couple of new issues to the pot:
>
>  1) It may become more common to reason about abstract resources whose
>  identifiers may not be readily representable as a location.  It would be
>  better to identify these with a URN.  Hence URNs may be more widely used
>  than at present.
>
>URI does not specify a location - HTTP specifies a way of retrieving
>something that has  URI. And the syntax allows pretty much anything to be
>named (incidentally so does the syntax for rational numbers, although it is
>even less user-friendly). The problem is not a technical one but a social
one
>of convincing people not to change the meaning of identifiers. Using a new
>set of identifiers simply passes the buck again.
>

URI is an abstract concept of 'identifier'.  They consist of either a URL
(identifier by location) or a URN (identifier by name).  I am claiming that
the need to refer to abstract concepts not locatable on the Internet will
probably increase as the semantic web progresses.

I agree that people must be convinced not to change the meaning of
identifiers (URIs).  This applies to both URLs and URNs.  The point of using
URNs is to keep URLs pure for locating resources that, in an ideal world,
will always be electronically accessible.  I.e. keeping unvancable URL links
down to 6%.

Is encouraging dilution of URL semantics to identify abstract resources W3C
policy?  Is it a commonly held view that this is a good thing by W3C staff?
As well as encourage people not to change the meaning of URIs, I suggest the
W3C should also encourage people to use appropriate identifiers, i.e. choose
between URLs and URNs - whichever is most appropriate - and not always use
URLs.

The only reason I can think the W3C would want to do otherwise is to avoid
URI scheme fragmentation.  However, I don't believe this should be much of
an issue.  It would be a big factor in the URI scheme publishers choose.  It
is likely that the vast majority will stick to schemes that are readily
understandable by current popular user agent software.  Or at least user
agent software used by their target audiences.

>  2) Encouraging reuse of RDF schema resources is antagonised by the need
for
>  schema owners to update them.  Currently the proposed RDFS solution is
very
>  MS COMesque - every schema update is a new resource that SHOULD be
constant
>  and available in perpituity.  IMHO this approach is inferior to
specifying a
>  common versioning scheme, which when coupled with a standard change
>  management process, allows related schemas to migrate and evolve in
tandem.
>  After all, that's exactly what happens with dependencies between software
>  components now.
>
>It is perfectly possible for RDF to describe complex versioning, such as
the
>fact that some parts of one schema haven't channged but others have, and
even
>complex inter-relationships that are byond the average linear versioning
>system.
>

I am not sure that the level of sophistication you suggest is warranted.  I
would suggest that version information be associated with the schema in its
entirety - the same way that versioning for W3C Working Drafts and RFCs are
done.  I.e. associate version with the resource, not fragments within the
resource.

In the absense of native versioning support in URLs, a nice touch would be
to include metadata in RDF schemas to indicate the resource that supercedes
/ is-superceded-by & obsoletes / is-obsoleted-by other RDF schemas.

>  3) Data quality will be poorer if it is hard for software to detect a
>  resource change.  Transience is bad news if you are going to store facts
>  about something that subsequently changes.
>
>Yes
>
>  What the solution to all this is I don't know.  I just can't help feeling
>  that as the semantic web progresses things are about to get a lot more
>  complicated unless these issues are addressed.
>
>The existing web suffers because people change the meaning of URIs (for
>example from "my homepage" to "there is no resource here - 404...").
Dealing
>with the existing problem will help the old world of the web as well as the
>new, and allow us to integrate it nicely. Shifting the new world to a
>different space fragments it but doesn't of itself provide a solution.
>
>(Just my .02 <insert currency>)
>

Broken URIs will always be a fact of life and nothing is going to change
that.  

Noone can claim that existing web technology is perfect, inhibiting
evolution leads to stagnation, or worse, diluting existing concepts in an
attempt to "shoehorn" new requirements into old solutions that are not
entirely suitable.

I am keen to see URLs remain pure - locating electronically accessible
resources, not abstract concepts and unlocatable resources.  That way the
existing web will continue to have a low level of 'linkrot' and be far
better for it.

>cheers
>
>Charles McCN
>
>  Regards
>
>  Lee



  Dan Connolly <connolly@w3.org> wrote:

  Pierre-Antoine CHAMPIN wrote:
  [...]
  > HTML and PDF version available at
  >   http://www710.univ-lyon1.fr/~champin/urls/

  This document promulgates a number of myths about
  Web Architecture...

    "However, other W3C recommendations use URLs to identify
    namespaces [6,8,11]. Those URLs do not locate the
    corresponding namespaces (which are abstract things and
    hence not locatable with the HTTP protocol), "

  I don't see any justification for the claim
  that namespaces are disjoint from HTTP resources.
  One of the primary motivations for the
  XML namespaces recommendation is to make the
  Web self-describing: each document carries the
  identifiers of the vocabularies/namespaces it's written in;
  the system works best when you can use such
  an identifier to GET a specification/description
  of the vocabulary.

    "For example, the URI of the type property of RDF is
    http://www.w3.org/1999/02/22-rdf-syntax-ns#type. As a
    matter of fact, the property itself is not located by that
    URL: its description is. "

  Again, I see no justification for the claim that this
  identifier doesn't identify a property.

    "URLs are transient

    That means they may become invalid after a certain period
    of time [9]. "

  That's a fact of life in a distributed system. URNs may
  become invalid after a period of time too. It's true
  of all URIs. URIs mean what we all agree that they mean.
  Agreement is facilitated by a lookup service like HTTP.
  In practice, URIs are quite reliable:
  6% linkrot according to http://www.useit.com/alertbox/980614.html
  and I think the numbers get better when you measure
  per-request rather than per-link,
  since popular pages are maintained
  more activly than average.

  Unless urn: URIs provide value
  that http: URIs do not, the won't be deployed.
  I think the fact that

   (a) urn:'s have been standardized
   (IETF Proposed Standard, that is) since 1995
   (b) support for them is available in popular browsers
   and has been for several generations
  and yet
   (c) still their use is negligible

  speaks for itself. They don't provide any value.
  Naming is a social contract, and the http: contract
  works and the urn: contract doesn't.


    "In the immediate interpretation, a URL identifies the
    resource retrieved through it."

  to be precise: it identifies the resource accessed
  thru it. In the general case, you can't retrieve
  a resource, but only a representation of one.
  Other text that makes this error includes:

    "... the retrieved resource ..."

  Another falsehood:

    "Contrarily to URLs, URNs (Uniform Resource Names) are
    designed to persistently identify a given resource."

  URIs in general are designed to persistently identifiy
  a given resource. Especially HTTP URIs.


  I recommend a series of articles by the designer
  of URIs, HTTP, and HTML to clarify a number of
  these myths:

    World Wide Web Design Issues
    http://www.w3.org/DesignIssues/

  esp

    The Web Model: Information hiding and URI syntax (Jan 98)
    http://www.w3.org/DesignIssues/Model

    The Myth of Names and Addresses
    http://www.w3.org/DesignIssues/NameMyth

    Persistent Domains- an idea for persistence of URIs(2000/10)
    http://www.w3.org/DesignIssues/PersistentDomains

    (Hmm... this one is an interesting idea, but I think freenet:
     might be easier to deploy.)

  and regarding the intent of the design of namespaces, see:

    cf Web Architecture: Extensible Languages
    W3C Note 10 Feb 1998
    http://www.w3.org/TR/NOTE-webarch-extlang


-- 
Charles McCathieNevile    http://www.w3.org/People/Charles  phone: +61 409
134 136
W3C Web Accessibility Initiative     http://www.w3.org/WAI    fax: +1 617
258 5999
Location: 21 Mitchell street FOOTSCRAY Vic 3011, Australia
(or W3C INRIA, Route des Lucioles, BP 93, 06902 Sophia Antipolis Cedex,
France)

Received on Wednesday, 11 April 2001 07:41:45 UTC