- From: Phil Archer <parcher@icra.org>
- Date: Thu, 03 May 2007 14:55:53 +0100
- To: Public POWDER <public-powderwg@w3.org>
Notwithstanding the issue around IRI -> URI mapping just raised, I've been working on the URI Canonicalization section. Looking at the URI spec [1] and a Note prepared in 2001 by Mark Nottingham [2] (that inspired a lot of Jo Rabin's work on grouping in the XG [3]) I propose the following which is an extension of what was discussed in Darmstadt: ==== Begins === Before any URI matching can take place, the following canonicalization rules, which are consistent with RFC3986 [URIS], should be applied to the candidate resource's URI. Percent-encoding triplets should be converted into their respective characters (e.g. %3A should be converted to :, %2F to / etc.). N.B. The hexadecimal digits are case-insensitive. Any plus signs (+) should be replaced with a blank spaces. If the scheme is absent, default to http, i.e. www.example.com becomes http://www.example.com The scheme and host are case insensitive but the canonical form of both is lower case. Therefore, these components in the candidate URI should be converted to lower case. Trailing '.' characters in the host should be removed, i.e. http://www.example.com. becomes http://www.example.com If the Path is absent, a path of '/' must be appended to the host, i.e. http://www.example.com becomes http://www.example.com/ The following are all equivalent: http%3A%2F%2Fwww.example.com%2Ffoo HTTp%3a%2f%2fwww.Example.Com%2Ffoo http://www.example.com/foo However, http://www.example.com/FOO is not equivalent since the path (FOO) is case sensitive. ==== ENDS === I've created a little script that does all this which you can try at [4]. Cheers Phil. [1] http://www.gbiv.com/protocols/uri/rfc/rfc3986.html [2] http://www.w3.org/2005/Incubator/wcl/matching.html [3] http://www.w3.org/2005/Incubator/wcl/matching.html [4] http://www.fosi.org/test/uriparse/
Received on Thursday, 3 May 2007 13:56:00 UTC