Matching URIs in RDF (with SPARQL) from Dan Brickley on 2006-07-01 (public-xg-wcl@w3.org from July 2006)

From: Dan Brickley <danbri@danbri.org>
Date: Sat, 01 Jul 2006 12:03:36 +0100
To: public-xg-wcl@w3.org
Cc: timbl@w3.org, connolly@w3.org
Message-ID: <44A65688.7070905@danbri.org>
(Am sending this to the XG's public list, bcc:'d to the member one. 
We're all on both, right? it's a good discussion to have in public...)


OK Some progress, based on the regex from Jo's doc. Rough notes from the 
SW Interest Group IRC channel, where I got some help putting this 
together. I've got a quick perl script that generates an RDF description 
of each entry in a list of URIs, and a SPARQL query plus various filters 
which match against some/all of these URIs. It uses a fictional 
namespace in http://www.w3.org/2004/12/q/ which reminds me to 
investigate whether I still have write-access there, and if we can use 
it for the XG.

Am Cc:'ing TimBL and DanC who may be interested. Tim, Dan --- this work 
is motivated by a desire to attach RDF descriptive labels to collections 
of documents picked out either by enumeration or by patterns expressed 
against URIs/IRIs. Jo Rabin's doc at 
http://www.w3.org/2005/Incubator/wcl/matching.html has more background.
There's some related work from OpenSearch folks at 
http://www.snellspace.com/wp/?p=369 that we're loosly connected to via 
Elias Torres in #swig.


For today's hack, see 
http://swig.xmlhack.com/2006/07/01/2006-07-01.html#1151749799.081592

Perl script:       http://spypixel.com/2006/wcl/uri/uri-pl-source.txt
List of URIs:      http://spypixel.com/2006/wcl/uri/sites.txt
Generated RDF:     http://spypixel.com/2006/wcl/uri/_data.rdf

example:
<ID xmlns='http://www.w3.org/2004/12/q/idsyntax#'>
<full>http://nobody:nothing@127.0.0.1:8080/dot/slash/dot?foo=bar;x=y</full>
   <nameFor 
rdf:resource='http://nobody:nothing@127.0.0.1:8080/dot/slash/dot?foo=bar;x=y'/> 

   <scheme>http</scheme>
   <authority>nobody:nothing@127.0.0.1:8080</authority>
   <userinfo>nobody:nothing</userinfo>
   <host>127.0.0.1</host>
   <port>8080</port>
   <path>/dot/slash/dot</path>
   <query>foo=bar;x=y</query>
</ID>

Example SPARQL:    http://spypixel.com/2006/wcl/uri/filter-test2.rq
(this runs OK in Jena/ARQ eg through the Twinkle GUI)

Here's the SPARQL example in full. Basically we match the URI 
descriptions, and then filter against the various strings using the
query language's FILTER functionality, in particular, regexs, and/or 
stuff, and exact matching with "=". The lines with a # are commented 
out. Note that there are some cases here we'll want for testing, eg. 
case of the URI scheme (hTtp: etc) could easily trip us up.

PREFIX u: <http://www.w3.org/2004/12/q/idsyntax#>
SELECT DISTINCT *
WHERE {
   ?id a u:ID .
   ?id u:full ?full .
   ?id u:nameFor ?res .
   ?id u:scheme ?scheme .
   ?id u:authority ?authority .
   OPTIONAL { ?id u:userinfo ?userinfo } .
   OPTIONAL { ?id u:host ?host } .
   OPTIONAL { ?id u:port ?port } .
   OPTIONAL { ?id u:path ?path } .
   OPTIONAL { ?id u:query ?query } .
   OPTIONAL { ?id u:fragment ?fragment } .
#  FILTER regex ( ?scheme, "http" ) . # schemes matching "http" ie.
includes https:
#  FILTER regex ( ?scheme, "^http$" ) . # http: scheme
#  FILTER regex ( ?scheme, "^HTTP$" ) . # HTTP: scheme (do we normalise
in the regex or the rdf?)
#  FILTER regex ( ?scheme, "^http$", "i" ) . # http: scheme, case
insensitive (more robust)
# FILTER regex(?scheme,"^http$","i")  && ( (?port = "8080") || (?port =
"1234") ).
#FILTER regex(?userinfo, ":") # password is given in the URI
FILTER regex(?host, "^pics|www\.pics") .
}



Easiest way to play with this is to download and run Twinkle from 
http://www.ldodds.com/projects/twinkle/ and use 
http://spypixel.com/2006/wcl/uri/_data.rdf as the data URI.

I've not got it running against the online Redland SPARQL query 
installation yet, will ask Dave Beckett where the problem is.

There are a few more comprehensive collections of 'tricky' URIs around, 
I'm not sure the exact status of any URI test suite but have collected 
up some links in the bottom of the perl script, reproduced here.

http://www.w3.org/Addressing/url_test/url_grammar.tests
http://www.ninebynine.org/Software/HaskellUtils/Network/URITestDescriptions.html
http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html

I've not investigated the IRI side yet, nor taken any care with charset 
issues (either in the data, or the perl/regex).

Next steps in the XG? It would be great if someone could try 
re-expressing the contents of 
www.w3.org/2005/Incubator/wcl/matching.html or Phil's recent msg 
http://lists.w3.org/Archives/Member/member-xg-wcl/2006Jun/0079.html 
(member-only link) using SPARQL filters plus this vocab. For those of us 
who prefer to do things with XML, I wonder whether the XML resultset 
format that SPARQL returns would be an acceptable compromise. If we run 
the above SPARQL query without any filters, it returns the following XML 
structure --- http://spypixel.com/2006/wcl/uri/_eg_results.txt

ie. markup like this:

     <result>
       <id bnodeid="b0"/>
       <full>HTTP://example.caps.example.org/</full>
       <res uri="HTTP://example.caps.example.org/"/>
       <scheme>HTTP</scheme>
       <authority>example.caps.example.org</authority>
       <userinfo bound="false"/>
       <host>example.caps.example.org</host>
       <port bound="false"/>
       <path>/</path>
       <query bound="false"/>
       <fragment bound="false"/>
     </result>

...for each result. Am thinking out loud here, not yet quite sure how 
all these ingredients fit together. And that's without even considering 
OWL, RIF etc. :)

cheers,

Dan
Received on Saturday, 1 July 2006 11:03:59 UTC