URI Canonicalization

Notwithstanding the issue around IRI -> URI mapping just raised, I've 
been working on the URI Canonicalization section. Looking at the URI 
spec [1] and a Note prepared in 2001 by Mark Nottingham [2] (that 
inspired a  lot of Jo Rabin's work on grouping in the XG [3]) I propose 
the following which is an extension of what was discussed in Darmstadt:

==== Begins ===

Before any URI matching can take place, the following canonicalization 
rules, which are consistent with RFC3986 [URIS], should be applied to 
the candidate resource's URI.

Percent-encoding triplets should be converted into their respective 
characters (e.g. %3A should be converted to :, %2F to / etc.). N.B. The 
hexadecimal digits are case-insensitive.

Any plus signs (+) should be replaced with a blank spaces.

If the scheme is absent, default to http, i.e. www.example.com becomes 
http://www.example.com

The scheme and host are case insensitive but the canonical form of both 
is lower case. Therefore, these components in the candidate URI should 
be converted to lower case.

Trailing '.' characters in the host should be removed, i.e. 
http://www.example.com. becomes http://www.example.com

If the Path is absent, a path of '/' must be appended to the host, i.e. 
http://www.example.com becomes http://www.example.com/

The following are all equivalent:
   http%3A%2F%2Fwww.example.com%2Ffoo
   HTTp%3a%2f%2fwww.Example.Com%2Ffoo
   http://www.example.com/foo

However, http://www.example.com/FOO is not equivalent since the path 
(FOO) is case sensitive.

==== ENDS ===

I've created a little script that does all this which you can try at [4].

Cheers

Phil.

[1] http://www.gbiv.com/protocols/uri/rfc/rfc3986.html
[2] http://www.w3.org/2005/Incubator/wcl/matching.html
[3] http://www.w3.org/2005/Incubator/wcl/matching.html
[4] http://www.fosi.org/test/uriparse/

Received on Thursday, 3 May 2007 13:56:00 UTC