RE: "canonical" URIs from David Orchard on 2002-03-19 (www-tag@w3.org from March 2002)

From: David Orchard <david.orchard@bea.com>
Date: Tue, 19 Mar 2002 10:04:42 -0800
To: <www-tag@w3.org>
Message-ID: <034801c1cf8e$8a622cc0$420ba8c0@beasys.com>
TAG members,

I don't see URI comparison officially listed as a TAG issue.  I'd like
Joseph/Stephen's issue added to the TAG issues list.

Equivalence rules for URIs are defined by the URI scheme.  HTTP has a
section on URI comparison.

However, XML does not have a default comparison function for the XML Schema
anyURI data type.  I think a reasonable approach would be to say that the
default comparision function for anyURI is to use the HTTP URI comparison
algorithm, but that it is overridable by any scheme.

Cheers,
Dave


> -----Original Message-----
> From: www-tag-request@w3.org
> [mailto:www-tag-request@w3.org]On Behalf Of
> Joseph Reagle
> Sent: Tuesday, February 19, 2002 11:40 AM
> To: www-tag@w3.org
> Cc: PhillipHallam-Baker; xme; Merlin Hughes; duerst@w3.org
> Subject: Re: "canonical" URIs
>
>
>
> Stephen has asked an interesting question below that I expect will be
> important  to any activity that uses URIs as identifiers in
> the context of
> a semantic/security application: when are two URI variants considered
> identical?
>
> My first impulse was to check the XML namespace spec,
> "[Definition:] URI
> references which identify namespaces are considered identical
> when they are
> exactly the same character-for-character." [a]
>
> [a] http://www.w3.org/TR/REC-xml-names/
>
> However, this could benefit from further specificity. What about the
> following sort of issues?
>
>   The URI attribute identifies a data object using a URI-Reference,
>   as specified by RFC2396 [URI]. The set of allowed characters for
>   URI attributes is the same as for XML, namely [Unicode]. However,
>   some Unicode characters are disallowed from URI references
>   including all non-ASCII characters and the excluded characters
>   listed in RFC2396 [URI, section 2.4]. However, the number sign (#),
>   percent sign (%), and square bracket characters re-allowed
> in RFC 2732
>   [URI-Literal] are permitted. Disallowed characters must be
> escaped as
>   follows: ...
>   http://www.w3.org/TR/2002/REC-xmldsig-core-20020212/#sec-URI
>
> I spoke to TimBL briefly about the question, he enumerated
> many of the
> places one might look for equivalence in the "URI stack"
> *while* stating
> that clearly one wouldn't want to address all these layers for the
> complexity and processing required:
>   URI spec
> 	string = string
>   HTTP DNS
> 	W3.org = w3.org
>   DNS LOOKUP
> 	www.w3.org   <-- CNAME --  w3.org
>   HTTP REDIRECT
> 	/foo --REDIRECT--> /foo/
>   RDF
> 	/foo = /bar
>
> Consequently, character by character comparison is probably the most
> straightforward approach -- assuming one addresses the
> character encoding
> issues well.
>
> Stephen is presently using "absolute URIs" with RFC2396
> equivalence (see
> below). This seems fairly straightforward as well -- though
> it says, "if
> the URI is case insensitive ..." I think it might be useful
> to specify
> whether case *is* relevant or not for that app. Any thoughts?
>
> Also, my broader question to the TAG is, does this seem like
> a worthwhile
> issue to address for all of our specifications? I also expect the
> validation/augmentation of URIs of type anyURI in schema
> might also be
> relevant to this question but haven't thought about it too carefully.
>
> [1] On Thursday 14 February 2002 06:01, Stephen Farrell wrote:
> > ...
> > The OASIS security committes's [1] SAML spec [2] is about access
> > control. One of its messages is of the form "can fred see
> > http://foo.com/stuff" with a minimal answer being "yes/no".
> >
> > Now, we're trying to figure a good way to tell implementors not
> > to fall for the following scenario:
> >
> > Q: "can fred see http://foo.com/stuff" A: no
> > Q: "can fred see HTTP://Foo.COM:80/stuff" A: no
> > Q: "can fred see http://foo.com/otherstuff/../stuff" A: yes
> >
> > Which involves us in giving some guidance for a "canonical
> > form" or URI, at least for the de-referencable via HTTP
> > URLs.
> >
> > My best bet so far is the following:
> >
> >    By the "canonical form" of a URI we mean an absolute URI (i.e. no
> >    relative URIs) which is the shortest of all the equivalent URI
> >    strings, where URI equivalence is defined according to [RFC2396].
> >    For example, the URI "http://foo.com:80/go/../go/to/" is not in
> >    canonical form, but "http://foo.com/go/to" is in canonical form.
> >    Note that if a URI is partly or entirely case-insensitive, then
> >    there will be more than one "canonical form" for that URI such
> >    that a case sensitive matching rule would consider that the
> >    strings differ (e.g. "HTTP://Foo.cOm/go/to" is "another"
> canonical
> >    form of the URL above).
> >
> >
> > Ta,
> > Stephen.
> >
> > [1] http://www.oasis-open.org/committees/security/
> > [2]
> >
> http://www.oasis-open.org/committees/security/docs/draft-sstc-
> core-25.pdf
> > [RFC2396] ftp://ftp.isi.edu/in-notes/rfc2396.txt
>
> --
>
> Joseph Reagle Jr.                 http://www.w3.org/People/Reagle/
> W3C Policy Analyst                mailto:reagle@w3.org
> IETF/W3C XML-Signature Co-Chair   http://www.w3.org/Signature/
> W3C XML Encryption Chair          http://www.w3.org/Encryption/2001/
>
>
Received on Tuesday, 19 March 2002 16:40:59 UTC