Cwm doesn't canonicalize ' ' and %20 in URIs

I tracked a bug with processing of GPS data and photos to
the fact that cwm was sitting with two separate nodes in the store,
which differed only by the difference between ' ' and %20.

It turns out that some systems will just let spaces through and others
will properly escape them in URIs.  In IRIs, many things are allowed  
but are
declared equivalent to their uftf8-hex-encoded counterparts.

I made two test cases for spaces.

One test case is

	$ cat space-in-uri.n3
	# See what a parser does with (a) a space and (b) an encoded space
	@prefix : <>.
	< bar> a :C.
	<> a :D.


where currently


which gives:

      @prefix : <> .

     <>     a :C .

     <>     a :D .

Note that there has been c'n done on output, so actually piping in  
through cwm twice gives the expected:

     <>     a :C,
                 :D .

There is another test case

	$ cat space-in-uri-rdf.rdf
	<rdf:RDF xmlns=""

	    <C rdf:about=" bar">

	    <D rdf:about="">

which gives the same results.

I think that cwm should do canonicalization of URIs when making
internal symbols.  This means that all IRIs (including URIs) should
be stored as canonical URIs.

See esp. 2.1 and 2.4

You can call it "URI-entailment' if you like.


Received on Saturday, 17 June 2006 19:02:43 UTC