RE: [BioRDF] URI Resolution from Booth, David (HP Software - Boston) on 2007-02-02 (public-semweb-lifesci@w3.org from February 2007)

From: Booth, David (HP Software - Boston) <dbooth@hp.com>
Date: Fri, 2 Feb 2007 00:18:55 -0500
To: "Jonathan Rees" <jonathan.rees@gmail.com>, "public-semweb-lifesci" <public-semweb-lifesci@w3.org>
Cc: "Susie Stephens" <susie.stephens@oracle.com>
Message-ID: <EBBD956B8A9002479B0C9CE9FE14A6C201FD7663@tayexc19.americas.cpqcorp.net>
Here is the full text of the draft at
http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/Documents?action=AttachFi
le&do=get&target=getting-information.txt so that people can easily
comment on specific portions by hitting "reply".  (But please edit your
reply to include only the portions relevant to your comment!)
[[
	 URI Resolution: Finding Information About a Resource
	   Jonathan Rees, Alan Ruttenberg, Matthias Samwald

 Version status: Very rough, still somewhat in outline form.  Not yet
 reviewed by AR or MS.  I will convert this to HTML and format it as a
 W3C "Interest Group Technical Note" when it is closer to being done.



Problem statement

Problem: An application has its hands on a URI, and needs to learn
more about the resource named by the URI.  Two kinds of information
are important: particular representations of the resource, if the
resource is an information resource; and RDF-encoded information about
the resource, regardless of whether the resource is or is not an
information resource.

(Recall that according to HTTP dogma, "resource" is an abstract
notion; a GET request returns a representation of an information
resource, not the resource itself.)

For a given resource R, often an information resource Q is available
that holds the information about R that the application needs
(possibly along with other information).  A common case is a
self-describing resource, i.e. R = Q.  Without much loss of
generality, we can take the finding-information problem to be that of
going from R's URI to the resolvable URL of an information resource Q
that either describes, or is, R.  Other methods of obtaining
information about R (such as a SOAP call or SPARQL query) may either
be cast as HTTP requests or are so similar to HTTP requests that they
do not introduce significant new issues.

As the process of finding and using information (including resource
representations) is automated, performance is often a serious issue;
some URL's for appropriate information resources will be served
quickly enough, and others won't.  There may be other constraints
dictating whether a URL is suitable for use, such as security
properties of the network link used to fetch the representation.

So here is the concise statement of what I'll call (somewhat
misleadingly, but for reasons of inertia) the URI Resolution Problem:

    Given a URI for a resource R, obtain if possible a URL for an
    information resource Q that provides desired information about R
    and/or a representation of R, such that Q may be accessed using a
    communication link that has adequate performance and privacy.

This problem has nothing specifically to do with HCLS, except that we
seem to be the ones suffering the most pain from it.  It also has
little to do with the semantic web per se, except to the extent that
the information we want to use is encoded in RDF.

Example:

    An RDF file is composed using URL's that all resolve nicely.  When
    years later someone tries to use the file, some of these same
    URL's are broken due to acquisitions, web site reorganizations,
    and changes of administration.  All the linked resources are
    available, just under different URL's.  How to make the user's
    application work without having to rewrite the RDF?


Why is this hard?

  - Non-URL URI (scheme not understood by applications, e.g. info:,
mailto:)

  - Broken link:
    . server gone or renamed
    . resource gone or renamed

  - Not-so-good URL:
    . communication link too slow
    . communication link not secure

  - Not-so-good content behind the URL:
    . resource R has no useable representation (e.g. not RDF)
    . R is too big
    . response to request is not a representation of the intended
resource
      (e.g. http://www.ihmc.us/users/phayes/PatHayes)
    . R doesn't contain the information about R that's needed by the
      application ("metadata" exists but is elsewhere)


What is the received wisdom?

  - Don't mint non-URL URI's. (TimBL)
      [good as far as it goes, but we may not be in a position to
choose]

  - Mint URL's whose hostname specifies a long-lived server that will
    maintain the resource at the given URL in perpetuity.  (Publishers,
    libraries, and universities are in good positions to do this.)
      [good as for as it goes, but user may not be in control, or may
      find quality name management to be beyond his/her grasp] 

  - Use a web cache such as Apache or Squid, and a proxy configuration
    on the client, to deliver the correct content when a URL is
presented
    that can't or shouldn't be used directly.
    (Dan Connolly)
      [this is a possible solution... see below]

  - Use LSID's.  LSID resolvers are very similar to web caches in that
    an intermediate server is deployed to map URIs.
      [requires maintenance of an LSID resolver; not all problematic
      URI's are LSID's]

  - If the type of the representation is unuseable, use content
    negotiation and/or GRDDL to get the right type of resource.
      [can Alan say more about why he dislikes content negotiation?]

  - If the server replies 303 See Other, follow the link in the
    response to get information about resource.
      [obscure hack but worth a try]
      (see http://www.w3.org/2001/tag/issues.html#httpRange-14)

  - To relate a non-information-resource to information about it,
    mint URI's of the form http://example.org/foo#bar to name the
    resource, with the convention that the URI http://example.org/foo
    will name an information resource that describes it.
      [obscure hack, probably too late to take hold, e.g.
      ontology http://xmlns.com/foaf/0.1/ doesn't use #]


What would a good solution be like?

Observation: We need information in order to find information.

  - Knowledge about how to resolve a URI ('resolution information')
    will often be idiosyncratic to a particular point of use, and will
    usually be found in the hands of the individual users who care
    most about it.

  - The people who have resolution information aren't necessarily
    server or web cache administrators.  [Client side is important.
    LSID resolvers and web caches are not very appropriate, and
    reliance on them will hinder advancement of SW.]

  - Resolution information changes all the time.  [Submitting a work
    request to a server administrator is not practical, even when there
    is a server and an administrator.]

  - There will inevitably be a way (or some ways) to express
    resolution information to the software that's able to use it.

  - Users will want to use the same resolution information with
    multiple applications.

  - Users will want to share resolution information with one another
    in various ways (email, inclusion in documents / systems, etc).


Received languages for configuring existing URI mappers/resolvers
include Apache configuration files (e.g. the RewriteRule directive),
SQUID configuration files, and LSID resolver configuration files [need
to research these].



Proposal: A URI resolution ontology.

The premise here is that we're dealing with semantic web applications
here, and we think RDF is a good knowledge representation language, so
let's use RDF to represent resolution information.

  - Kinds of information that could be represented using such an
ontology:

    . InformationResource vs. NotAnInformationResource

    . Lifetime expectation information, e.g. doesn't change, expires;
      cf. HTTP Cache-Control: and other headers

    . Retrieval methods: direct; URI transformation; SPARQL; web
      service

    . Client-side content-type awareness and "content negotiation"
      (choice among variants)

    . Properties: Version description, DC, MD5, ...

    . Relations among resources: e.g. relate resource to information
      resource that describes it

  - Don't share bare URI's; provide resolution information.  You get
    to choose whether the resolution information resides inside the
    document that uses the URI, or is carried independently of that
    document.

  - OWL can express rich properties and relations, e.g. resolution
    policies that apply to all objects of a given type.

  - OWL makes application of resolution tactics automatic,
    predictable, uniform (across applications), and error-free.

  - OWL-based resolution information could be used directly by an
    application, by a client-side web cache (e.g. local SQUID
    installation), or by a shared web cache.

  - Disadvantage: you need an OWL engine to interpret resolution
    information represented in this way, and not all applications have
    an OWL engine.  [so why not get one and link it in?]


[Discussion - how would we develop and deploy such a thing?]


[Related issue: versioning.]

    

See Tim's slides and other documents for his take on URI's, e.g.

http://dig.csail.mit.edu/2007/Talks/0108-swuri-tbl/


Acknowledgments: Chris Hanson, Tim Berners-Lee, Dan Connolly
]]


David Booth, Ph.D.
HP Software
dbooth@hp.com
Phone: +1 617 629 8881
  

> -----Original Message-----
> From: public-semweb-lifesci-request@w3.org 
> [mailto:public-semweb-lifesci-request@w3.org] On Behalf Of 
> Jonathan Rees
> Sent: Thursday, February 01, 2007 6:21 PM
> To: Susie Stephens
> Cc: public-semweb-lifesci
> Subject: Re: [BioRDF] URI Resolution
> 
> 
> Sorry, I should have changed the subject line. Please reply to this
> message, not the previous one, so that the thread gets properly
> threaded.
> 
> On 2/1/07, Jonathan Rees <jonathan.rees@gmail.com> wrote:
> > As promised, here's a draft of a document about what we've been
> > calling the "URI resolution" problem, building on Alan's 
> presentation
> > at the Amsterdam F2F. It's obviously not finished but comments are
> > welcome.
> >
> > 
> http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/Documents?actio
n=AttachFile&do=get&target=getting-information.txt
> >
> > Jonathan
> >
> 
>
Received on Friday, 2 February 2007 05:23:12 UTC