Re: "canonical" URIs from Joseph Reagle on 2002-02-19 (www-tag@w3.org from February 2002)

From: Joseph Reagle <reagle@w3.org>
Date: Tue, 19 Feb 2002 14:39:59 -0500
To: www-tag@w3.org
Cc: PhillipHallam-Baker <pbaker@verisign.com>, xme <stephen.farrell@baltimore.ie>, Merlin Hughes <merlin@baltimore.ie>, duerst@w3.org
Message-Id: <200202191939.OAA10552@tux.w3.org>
Stephen has asked an interesting question below that I expect will be 
important  to any activity that uses URIs as identifiers in the context of 
a semantic/security application: when are two URI variants considered 
identical?

My first impulse was to check the XML namespace spec, "[Definition:] URI 
references which identify namespaces are considered identical when they are 
exactly the same character-for-character." [a] 

[a] http://www.w3.org/TR/REC-xml-names/

However, this could benefit from further specificity. What about the 
following sort of issues?

  The URI attribute identifies a data object using a URI-Reference,
  as specified by RFC2396 [URI]. The set of allowed characters for 
  URI attributes is the same as for XML, namely [Unicode]. However,
  some Unicode characters are disallowed from URI references
  including all non-ASCII characters and the excluded characters
  listed in RFC2396 [URI, section 2.4]. However, the number sign (#),
  percent sign (%), and square bracket characters re-allowed in RFC 2732
  [URI-Literal] are permitted. Disallowed characters must be escaped as
  follows: ...
  http://www.w3.org/TR/2002/REC-xmldsig-core-20020212/#sec-URI

I spoke to TimBL briefly about the question, he enumerated many of the 
places one might look for equivalence in the "URI stack" *while* stating 
that clearly one wouldn't want to address all these layers for the 
complexity and processing required:
  URI spec
	string = string
  HTTP DNS
	W3.org = w3.org
  DNS LOOKUP
	www.w3.org   <-- CNAME --  w3.org
  HTTP REDIRECT
	/foo --REDIRECT--> /foo/
  RDF
	/foo = /bar

Consequently, character by character comparison is probably the most 
straightforward approach -- assuming one addresses the character encoding 
issues well. 

Stephen is presently using "absolute URIs" with RFC2396 equivalence (see 
below). This seems fairly straightforward as well -- though it says, "if 
the URI is case insensitive ..." I think it might be useful to specify 
whether case *is* relevant or not for that app. Any thoughts?

Also, my broader question to the TAG is, does this seem like a worthwhile 
issue to address for all of our specifications? I also expect the 
validation/augmentation of URIs of type anyURI in schema might also be 
relevant to this question but haven't thought about it too carefully.

[1] On Thursday 14 February 2002 06:01, Stephen Farrell wrote:
> ...
> The OASIS security committes's [1] SAML spec [2] is about access
> control. One of its messages is of the form "can fred see
> http://foo.com/stuff" with a minimal answer being "yes/no".
>
> Now, we're trying to figure a good way to tell implementors not
> to fall for the following scenario:
>
> Q: "can fred see http://foo.com/stuff" A: no
> Q: "can fred see HTTP://Foo.COM:80/stuff" A: no
> Q: "can fred see http://foo.com/otherstuff/../stuff" A: yes
>
> Which involves us in giving some guidance for a "canonical
> form" or URI, at least for the de-referencable via HTTP
> URLs.
>
> My best bet so far is the following:
>
>    By the "canonical form" of a URI we mean an absolute URI (i.e. no
>    relative URIs) which is the shortest of all the equivalent URI
>    strings, where URI equivalence is defined according to [RFC2396].
>    For example, the URI "http://foo.com:80/go/../go/to/" is not in
>    canonical form, but "http://foo.com/go/to" is in canonical form.
>    Note that if a URI is partly or entirely case-insensitive, then
>    there will be more than one "canonical form" for that URI such
>    that a case sensitive matching rule would consider that the
>    strings differ (e.g. "HTTP://Foo.cOm/go/to" is "another" canonical
>    form of the URL above).
>
>
> Ta,
> Stephen.
>
> [1] http://www.oasis-open.org/committees/security/
> [2]
> http://www.oasis-open.org/committees/security/docs/draft-sstc-core-25.pdf
> [RFC2396] ftp://ftp.isi.edu/in-notes/rfc2396.txt

-- 

Joseph Reagle Jr.                 http://www.w3.org/People/Reagle/
W3C Policy Analyst                mailto:reagle@w3.org
IETF/W3C XML-Signature Co-Chair   http://www.w3.org/Signature/
W3C XML Encryption Chair          http://www.w3.org/Encryption/2001/
Received on Tuesday, 19 February 2002 14:40:06 UTC