Re: Matching URIs in RDF (with SPARQL) from Phil Archer on 2006-07-03 (public-xg-wcl@w3.org from July 2006)

From: Phil Archer <parcher@icra.org>
Date: Mon, 03 Jul 2006 22:20:37 +0100
To: public-xg-wcl@w3.org
Message-ID: <44A98A25.8010504@icra.org>
Just as an aside, I've made Dan's script into a little API on icra.org.

The base URI of the application is 
http://www.icra.org/cgi-bin/wcl/uriparser.cgi

It takes two parameters:

uri (mandatory)
Not surprisingly, this is the URI you wish to parse

output = html | rdf (optional)
By default, the script returns an RDF description of the given URI using 
the notation Dan worked out below (the script is just an adaptation of 
his). If you want to see a breakdown of the URI in HTML tabular form, 
add output=html.

Some twiddly bits:

- If no scheme is given, it defgaults to HTTP (this is declared in the 
HTML output)
- All URIs are normalised using the URI module for Perl [1]. Basically, 
the scheme and authority are all put into lower case. Path upwards is 
left as given.

So, for example,

http://www.icra.org/cgi-bin/wcl/uriparser.cgi?uri=example.org&output=html

Gives you a breakdown of http://example.org in HTML format.

Phil.

[1] http://search.cpan.org/~gaas/URI-1.35/URI.pm


Dan Brickley wrote:
> 
> (Am sending this to the XG's public list, bcc:'d to the member one. 
> We're all on both, right? it's a good discussion to have in public...)
> 
> 
> OK Some progress, based on the regex from Jo's doc. Rough notes from the 
> SW Interest Group IRC channel, where I got some help putting this 
> together. I've got a quick perl script that generates an RDF description 
> of each entry in a list of URIs, and a SPARQL query plus various filters 
> which match against some/all of these URIs. It uses a fictional 
> namespace in http://www.w3.org/2004/12/q/ which reminds me to 
> investigate whether I still have write-access there, and if we can use 
> it for the XG.
> 
> Am Cc:'ing TimBL and DanC who may be interested. Tim, Dan --- this work 
> is motivated by a desire to attach RDF descriptive labels to collections 
> of documents picked out either by enumeration or by patterns expressed 
> against URIs/IRIs. Jo Rabin's doc at 
> http://www.w3.org/2005/Incubator/wcl/matching.html has more background.
> There's some related work from OpenSearch folks at 
> http://www.snellspace.com/wp/?p=369 that we're loosly connected to via 
> Elias Torres in #swig.
> 
> 
> For today's hack, see 
> http://swig.xmlhack.com/2006/07/01/2006-07-01.html#1151749799.081592
> 
> Perl script:       http://spypixel.com/2006/wcl/uri/uri-pl-source.txt
> List of URIs:      http://spypixel.com/2006/wcl/uri/sites.txt
> Generated RDF:     http://spypixel.com/2006/wcl/uri/_data.rdf
> 
> example:
> <ID xmlns='http://www.w3.org/2004/12/q/idsyntax#'>
> <full>http://nobody:nothing@127.0.0.1:8080/dot/slash/dot?foo=bar;x=y</full>
>   <nameFor 
> rdf:resource='http://nobody:nothing@127.0.0.1:8080/dot/slash/dot?foo=bar;x=y'/> 
> 
>   <scheme>http</scheme>
>   <authority>nobody:nothing@127.0.0.1:8080</authority>
>   <userinfo>nobody:nothing</userinfo>
>   <host>127.0.0.1</host>
>   <port>8080</port>
>   <path>/dot/slash/dot</path>
>   <query>foo=bar;x=y</query>
> </ID>
> 
> Example SPARQL:    http://spypixel.com/2006/wcl/uri/filter-test2.rq
> (this runs OK in Jena/ARQ eg through the Twinkle GUI)
> 
> Here's the SPARQL example in full. Basically we match the URI 
> descriptions, and then filter against the various strings using the
> query language's FILTER functionality, in particular, regexs, and/or 
> stuff, and exact matching with "=". The lines with a # are commented 
> out. Note that there are some cases here we'll want for testing, eg. 
> case of the URI scheme (hTtp: etc) could easily trip us up.
> 
> PREFIX u: <http://www.w3.org/2004/12/q/idsyntax#>
> SELECT DISTINCT *
> WHERE {
>   ?id a u:ID .
>   ?id u:full ?full .
>   ?id u:nameFor ?res .
>   ?id u:scheme ?scheme .
>   ?id u:authority ?authority .
>   OPTIONAL { ?id u:userinfo ?userinfo } .
>   OPTIONAL { ?id u:host ?host } .
>   OPTIONAL { ?id u:port ?port } .
>   OPTIONAL { ?id u:path ?path } .
>   OPTIONAL { ?id u:query ?query } .
>   OPTIONAL { ?id u:fragment ?fragment } .
> #  FILTER regex ( ?scheme, "http" ) . # schemes matching "http" ie.
> includes https:
> #  FILTER regex ( ?scheme, "^http$" ) . # http: scheme
> #  FILTER regex ( ?scheme, "^HTTP$" ) . # HTTP: scheme (do we normalise
> in the regex or the rdf?)
> #  FILTER regex ( ?scheme, "^http$", "i" ) . # http: scheme, case
> insensitive (more robust)
> # FILTER regex(?scheme,"^http$","i")  && ( (?port = "8080") || (?port =
> "1234") ).
> #FILTER regex(?userinfo, ":") # password is given in the URI
> FILTER regex(?host, "^pics|www\.pics") .
> }
> 
> 
> 
> Easiest way to play with this is to download and run Twinkle from 
> http://www.ldodds.com/projects/twinkle/ and use 
> http://spypixel.com/2006/wcl/uri/_data.rdf as the data URI.
> 
> I've not got it running against the online Redland SPARQL query 
> installation yet, will ask Dave Beckett where the problem is.
> 
> There are a few more comprehensive collections of 'tricky' URIs around, 
> I'm not sure the exact status of any URI test suite but have collected 
> up some links in the bottom of the perl script, reproduced here.
> 
> http://www.w3.org/Addressing/url_test/url_grammar.tests
> http://www.ninebynine.org/Software/HaskellUtils/Network/URITestDescriptions.html 
> 
> http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html
> 
> I've not investigated the IRI side yet, nor taken any care with charset 
> issues (either in the data, or the perl/regex).
> 
> Next steps in the XG? It would be great if someone could try 
> re-expressing the contents of 
> www.w3.org/2005/Incubator/wcl/matching.html or Phil's recent msg 
> http://lists.w3.org/Archives/Member/member-xg-wcl/2006Jun/0079.html 
> (member-only link) using SPARQL filters plus this vocab. For those of us 
> who prefer to do things with XML, I wonder whether the XML resultset 
> format that SPARQL returns would be an acceptable compromise. If we run 
> the above SPARQL query without any filters, it returns the following XML 
> structure --- http://spypixel.com/2006/wcl/uri/_eg_results.txt
> 
> ie. markup like this:
> 
>     <result>
>       <id bnodeid="b0"/>
>       <full>HTTP://example.caps.example.org/</full>
>       <res uri="HTTP://example.caps.example.org/"/>
>       <scheme>HTTP</scheme>
>       <authority>example.caps.example.org</authority>
>       <userinfo bound="false"/>
>       <host>example.caps.example.org</host>
>       <port bound="false"/>
>       <path>/</path>
>       <query bound="false"/>
>       <fragment bound="false"/>
>     </result>
> 
> ...for each result. Am thinking out loud here, not yet quite sure how 
> all these ingredients fit together. And that's without even considering 
> OWL, RIF etc. :)
> 
> cheers,
> 
> Dan
> 
> 
> 
> 

-- 
Phil Archer
Chief Technical Officer, ICRA
t. +44 (0)1473 434770
Skype: philarcher
w. http://www.icra.org/people/philarcher/

Working for a Safer Internet
Received on Monday, 3 July 2006 21:20:44 UTC