Cwm doesn't canonicalize ' ' and %20 in URIs

I tracked a bug with processing of GPS data and photos to
the fact that cwm was sitting with two separate nodes in the store,
which differed only by the difference between ' ' and %20.

It turns out that some systems will just let spaces through and others
will properly escape them in URIs.  In IRIs, many things are allowed  
but are
declared equivalent to their uftf8-hex-encoded counterparts.

I made two test cases for spaces.

One test case is

	$ cat space-in-uri.n3
	# See what a parser does with (a) a space and (b) an encoded space
	
	@prefix : <http://example.com/baz#>.
	
	<http://example.com/foo bar> a :C.
	<http://example.com/foo%20bar> a :D.

	#ends

where currently

     cwm http://www.w3.org/2000/10/swap/test/syntax/space-in-uri.n3

which gives:

      @prefix : <http://example.com/baz#> .

     <http://example.com/foo%20bar>     a :C .

     <http://example.com/foo%20bar>     a :D .

Note that there has been c'n done on output, so actually piping in  
through cwm twice gives the expected:

     <http://example.com/foo%20bar>     a :C,
                 :D .


There is another test case

	$ cat space-in-uri-rdf.rdf
	<rdf:RDF xmlns="http://example.com/baz#"
	    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

	    <C rdf:about="http://example.com/foo bar">
	    </C>

	    <D rdf:about="http://example.com/foo%20bar">
	    </D>
	</rdf:RDF>

which gives the same results.

I think that cwm should do canonicalization of URIs when making
internal symbols.  This means that all IRIs (including URIs) should
be stored as canonical URIs.

See http://www.ietf.org/rfc/rfc3986.txt esp. 2.1 and 2.4

You can call it "URI-entailment' if you like.

Tim

Received on Saturday, 17 June 2006 19:02:43 UTC