[Fwd: Matching URIs in RDF (with SPARQL)] from Shadi Abou-Zahra on 2007-01-15 (public-wai-ert@w3.org from January 2007)

From: Shadi Abou-Zahra <shadi@w3.org>
Date: Mon, 15 Jan 2007 14:14:16 +0100
To: public-wai-ert@w3.org
Message-ID: <45AB7E28.4000303@w3.org>
-------- Original Message --------
Subject: Matching URIs in RDF (with SPARQL)
Resent-Date: Sat, 01 Jul 2006 11:04:01 +0000
Resent-From: public-xg-wcl@w3.org
Date: Sat, 01 Jul 2006 12:03:36 +0100
From: Dan Brickley <danbri@danbri.org>
Reply-To: danbri@danbri.org
To: public-xg-wcl@w3.org
CC: timbl@w3.org, connolly@w3.org


(Am sending this to the XG's public list, bcc:'d to the member one.
We're all on both, right? it's a good discussion to have in public...)


OK Some progress, based on the regex from Jo's doc. Rough notes from the
SW Interest Group IRC channel, where I got some help putting this
together. I've got a quick perl script that generates an RDF description
of each entry in a list of URIs, and a SPARQL query plus various filters
which match against some/all of these URIs. It uses a fictional
namespace in http://www.w3.org/2004/12/q/ which reminds me to
investigate whether I still have write-access there, and if we can use
it for the XG.

Am Cc:'ing TimBL and DanC who may be interested. Tim, Dan --- this work
is motivated by a desire to attach RDF descriptive labels to collections
of documents picked out either by enumeration or by patterns expressed
against URIs/IRIs. Jo Rabin's doc at
http://www.w3.org/2005/Incubator/wcl/matching.html has more background.
There's some related work from OpenSearch folks at
http://www.snellspace.com/wp/?p=369 that we're loosly connected to via
Elias Torres in #swig.


For today's hack, see
http://swig.xmlhack.com/2006/07/01/2006-07-01.html#1151749799.081592

Perl script:       http://spypixel.com/2006/wcl/uri/uri-pl-source.txt
List of URIs:      http://spypixel.com/2006/wcl/uri/sites.txt
Generated RDF:     http://spypixel.com/2006/wcl/uri/_data.rdf

example:
<ID xmlns='http://www.w3.org/2004/12/q/idsyntax#'>
<full>http://nobody:nothing@127.0.0.1:8080/dot/slash/dot?foo=bar;x=y</full>
   <nameFor
rdf:resource='http://nobody:nothing@127.0.0.1:8080/dot/slash/dot?foo=bar;x=y'/> 



   <scheme>http</scheme>
   <authority>nobody:nothing@127.0.0.1:8080</authority>
   <userinfo>nobody:nothing</userinfo>
   <host>127.0.0.1</host>
   <port>8080</port>
   <path>/dot/slash/dot</path>
   <query>foo=bar;x=y</query>
</ID>

Example SPARQL:    http://spypixel.com/2006/wcl/uri/filter-test2.rq
(this runs OK in Jena/ARQ eg through the Twinkle GUI)

Here's the SPARQL example in full. Basically we match the URI
descriptions, and then filter against the various strings using the
query language's FILTER functionality, in particular, regexs, and/or
stuff, and exact matching with "=". The lines with a # are commented
out. Note that there are some cases here we'll want for testing, eg.
case of the URI scheme (hTtp: etc) could easily trip us up.

PREFIX u: <http://www.w3.org/2004/12/q/idsyntax#>
SELECT DISTINCT *
WHERE {
   ?id a u:ID .
   ?id u:full ?full .
   ?id u:nameFor ?res .
   ?id u:scheme ?scheme .
   ?id u:authority ?authority .
   OPTIONAL { ?id u:userinfo ?userinfo } .
   OPTIONAL { ?id u:host ?host } .
   OPTIONAL { ?id u:port ?port } .
   OPTIONAL { ?id u:path ?path } .
   OPTIONAL { ?id u:query ?query } .
   OPTIONAL { ?id u:fragment ?fragment } .
#  FILTER regex ( ?scheme, "http" ) . # schemes matching "http" ie.
includes https:
#  FILTER regex ( ?scheme, "^http$" ) . # http: scheme
#  FILTER regex ( ?scheme, "^HTTP$" ) . # HTTP: scheme (do we normalise
in the regex or the rdf?)
#  FILTER regex ( ?scheme, "^http$", "i" ) . # http: scheme, case
insensitive (more robust)
# FILTER regex(?scheme,"^http$","i")  && ( (?port = "8080") || (?port =
"1234") ).
#FILTER regex(?userinfo, ":") # password is given in the URI
FILTER regex(?host, "^pics|www\.pics") .
}



Easiest way to play with this is to download and run Twinkle from
http://www.ldodds.com/projects/twinkle/ and use
http://spypixel.com/2006/wcl/uri/_data.rdf as the data URI.

I've not got it running against the online Redland SPARQL query
installation yet, will ask Dave Beckett where the problem is.

There are a few more comprehensive collections of 'tricky' URIs around,
I'm not sure the exact status of any URI test suite but have collected
up some links in the bottom of the perl script, reproduced here.

http://www.w3.org/Addressing/url_test/url_grammar.tests
http://www.ninebynine.org/Software/HaskellUtils/Network/URITestDescriptions.html
http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html

I've not investigated the IRI side yet, nor taken any care with charset
issues (either in the data, or the perl/regex).

Next steps in the XG? It would be great if someone could try
re-expressing the contents of
www.w3.org/2005/Incubator/wcl/matching.html or Phil's recent msg
http://lists.w3.org/Archives/Member/member-xg-wcl/2006Jun/0079.html
(member-only link) using SPARQL filters plus this vocab. For those of us
who prefer to do things with XML, I wonder whether the XML resultset
format that SPARQL returns would be an acceptable compromise. If we run
the above SPARQL query without any filters, it returns the following XML
structure --- http://spypixel.com/2006/wcl/uri/_eg_results.txt

ie. markup like this:

     <result>
       <id bnodeid="b0"/>
       <full>HTTP://example.caps.example.org/</full>
       <res uri="HTTP://example.caps.example.org/"/>
       <scheme>HTTP</scheme>
       <authority>example.caps.example.org</authority>
       <userinfo bound="false"/>
       <host>example.caps.example.org</host>
       <port bound="false"/>
       <path>/</path>
       <query bound="false"/>
       <fragment bound="false"/>
     </result>

...for each result. Am thinking out loud here, not yet quite sure how
all these ingredients fit together. And that's without even considering
OWL, RIF etc. :)

cheers,

Dan



-- 
Shadi Abou-Zahra     Web Accessibility Specialist for Europe |
Chair & Staff Contact for the Evaluation and Repair Tools WG |
World Wide Web Consortium (W3C)           http://www.w3.org/ |
Web Accessibility Initiative (WAI),   http://www.w3.org/WAI/ |
WAI-TIES Project,                http://www.w3.org/WAI/TIES/ |
Evaluation and Repair Tools WG,    http://www.w3.org/WAI/ER/ |
2004, Route des Lucioles - 06560,  Sophia-Antipolis - France |
Voice: +33(0)4 92 38 50 64          Fax: +33(0)4 92 38 78 22 |
Received on Monday, 15 January 2007 13:14:35 UTC