Re: "canonical" URIs

Stephen has asked an interesting question below that I expect will be 
important  to any activity that uses URIs as identifiers in the context of 
a semantic/security application: when are two URI variants considered 

My first impulse was to check the XML namespace spec, "[Definition:] URI 
references which identify namespaces are considered identical when they are 
exactly the same character-for-character." [a] 


However, this could benefit from further specificity. What about the 
following sort of issues?

  The URI attribute identifies a data object using a URI-Reference,
  as specified by RFC2396 [URI]. The set of allowed characters for 
  URI attributes is the same as for XML, namely [Unicode]. However,
  some Unicode characters are disallowed from URI references
  including all non-ASCII characters and the excluded characters
  listed in RFC2396 [URI, section 2.4]. However, the number sign (#),
  percent sign (%), and square bracket characters re-allowed in RFC 2732
  [URI-Literal] are permitted. Disallowed characters must be escaped as
  follows: ...

I spoke to TimBL briefly about the question, he enumerated many of the 
places one might look for equivalence in the "URI stack" *while* stating 
that clearly one wouldn't want to address all these layers for the 
complexity and processing required:
  URI spec
	string = string
	/foo --REDIRECT--> /foo/
	/foo = /bar

Consequently, character by character comparison is probably the most 
straightforward approach -- assuming one addresses the character encoding 
issues well. 

Stephen is presently using "absolute URIs" with RFC2396 equivalence (see 
below). This seems fairly straightforward as well -- though it says, "if 
the URI is case insensitive ..." I think it might be useful to specify 
whether case *is* relevant or not for that app. Any thoughts?

Also, my broader question to the TAG is, does this seem like a worthwhile 
issue to address for all of our specifications? I also expect the 
validation/augmentation of URIs of type anyURI in schema might also be 
relevant to this question but haven't thought about it too carefully.

[1] On Thursday 14 February 2002 06:01, Stephen Farrell wrote:
> ...
> The OASIS security committes's [1] SAML spec [2] is about access
> control. One of its messages is of the form "can fred see
>" with a minimal answer being "yes/no".
> Now, we're trying to figure a good way to tell implementors not
> to fall for the following scenario:
> Q: "can fred see" A: no
> Q: "can fred see HTTP://Foo.COM:80/stuff" A: no
> Q: "can fred see" A: yes
> Which involves us in giving some guidance for a "canonical
> form" or URI, at least for the de-referencable via HTTP
> URLs.
> My best bet so far is the following:
>    By the "canonical form" of a URI we mean an absolute URI (i.e. no
>    relative URIs) which is the shortest of all the equivalent URI
>    strings, where URI equivalence is defined according to [RFC2396].
>    For example, the URI "" is not in
>    canonical form, but "" is in canonical form.
>    Note that if a URI is partly or entirely case-insensitive, then
>    there will be more than one "canonical form" for that URI such
>    that a case sensitive matching rule would consider that the
>    strings differ (e.g. "HTTP://Foo.cOm/go/to" is "another" canonical
>    form of the URL above).
> Ta,
> Stephen.
> [1]
> [2]
> [RFC2396]


Joseph Reagle Jr.       
W3C Policy Analyst      
IETF/W3C XML-Signature Co-Chair
W3C XML Encryption Chair

Received on Tuesday, 19 February 2002 14:40:06 UTC