Re: Matching URIs in RDF (with SPARQL)

(Dan, I'm not sure everyone on the Member list is on the public one, 
hence Bcc to member as well).

This is obviously very interesting stuff! I can see that it would be 
useful to set up an open API that returned RDF descriptions of a list of 
URIs - it's a simple enough task as your Perl script shows. Since there 
is clearly interest in the idea of an RDF vocabulary for parts of a URI, 
and this probably has application well beyond WCL, I wonder whether we 
should define that in a separate namespace (and use it)?

Meanwhile, I sat down to work these ideas into the WCL requirements. We 
need to be able to define a group of URIs: "everything with this scheme, 
everything with this authority, everything on this list etc." Then we 
need to find out whether a given URI is a member of that group (and then 
we get into sub-groups that have different labels but let's leave that 
for now)

The input URI is therefore the URI of the thing for which we want a 
label. That is, I have http://example.com and I want to find out what 
its Content Label is.

I've set up an examples directory on the WCL space and put up "eg 1" 
http://www.w3.org/2005/Incubator/wcl/examples/eg1.rdf which is the 
example in the current version of the Report [1]. It's bound to change 
soon but it's OK for this discussion.

Using Twinkle (thanks for that link!) I can run this query:

PREFIX wcl: <http://www.w3.org/2004/12/q/contentlabel#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT *
WHERE {
   ?r rdf:type wcl:Ruleset .
   ?r wcl:hasScope ?s .
   OPTIONAL { ?s wcl:scheme ?scheme }.
   OPTIONAL { ?s wcl:host ?host }.
   OPTIONAL { ?r wcl:hasDefaultLabel ?defLab } .
}

This identifies a WCL Ruleset, finds the Scope node (?s) and then looks 
to see whether any schemes or hosts are defined for the group. It is 
clearly analogous to look for paths, queries and fragments. Running the 
query on eg.1 gives

1 ( ?r = _:b8 ) ( ?s = _:b9 )
   ( ?host = "resources.example.co.uk" )
   ( ?defLab = 
<http://www.w3.org/2005/Incubator/wcl/examples/eg1.rdf#label_1> )

2 ( ?r = _:b8 ) ( ?s = _:b9 )
   ( ?host = "resources.example.com" )
   ( ?defLab = 
<http://www.w3.org/2005/Incubator/wcl/examples/eg1.rdf#label_1> )

Which tells me that the scope for the labels here is the hosts 
resources.example.co.uk and resources.example.com and that for both of 
them, the default label is 
<http://www.w3.org/2005/Incubator/wcl/examples/eg1.rdf#label_1>.

Now, to find out the label for my input URI (http://example.com) I can 
parse the URI and find that the host is example.com. This is not the 
same or a subdomain of either of the two hosts in scope so, no, I have 
no label for this resource.

Are you, Dan, suggesting that the input URI should be parsed and 
converted into an RDF description? This might make sense if there's a 
simple SPARQL query that can match two things together (or not) but I 
don't think SPARQL is meant to work over multiple data sources is it? I 
doubt that's what you mean.

I do think that the report needs to include some sample SPARQL queries. 
When we get into the rules section these starts to get a bit more 
complicated but they're not too bad (at least, they aren't in RDF-CL).

Phil.

[1] http://www.w3.org/2005/Incubator/wcl/XGR-report/





Dan Brickley wrote:
> 
> (Am sending this to the XG's public list, bcc:'d to the member one. 
> We're all on both, right? it's a good discussion to have in public...)
> 
> 
> OK Some progress, based on the regex from Jo's doc. Rough notes from the 
> SW Interest Group IRC channel, where I got some help putting this 
> together. I've got a quick perl script that generates an RDF description 
> of each entry in a list of URIs, and a SPARQL query plus various filters 
> which match against some/all of these URIs. It uses a fictional 
> namespace in http://www.w3.org/2004/12/q/ which reminds me to 
> investigate whether I still have write-access there, and if we can use 
> it for the XG.
> 
> Am Cc:'ing TimBL and DanC who may be interested. Tim, Dan --- this work 
> is motivated by a desire to attach RDF descriptive labels to collections 
> of documents picked out either by enumeration or by patterns expressed 
> against URIs/IRIs. Jo Rabin's doc at 
> http://www.w3.org/2005/Incubator/wcl/matching.html has more background.
> There's some related work from OpenSearch folks at 
> http://www.snellspace.com/wp/?p=369 that we're loosly connected to via 
> Elias Torres in #swig.
> 
> 
> For today's hack, see 
> http://swig.xmlhack.com/2006/07/01/2006-07-01.html#1151749799.081592
> 
> Perl script:       http://spypixel.com/2006/wcl/uri/uri-pl-source.txt
> List of URIs:      http://spypixel.com/2006/wcl/uri/sites.txt
> Generated RDF:     http://spypixel.com/2006/wcl/uri/_data.rdf
> 
> example:
> <ID xmlns='http://www.w3.org/2004/12/q/idsyntax#'>
> <full>http://nobody:nothing@127.0.0.1:8080/dot/slash/dot?foo=bar;x=y</full>
>   <nameFor 
> rdf:resource='http://nobody:nothing@127.0.0.1:8080/dot/slash/dot?foo=bar;x=y'/> 
> 
>   <scheme>http</scheme>
>   <authority>nobody:nothing@127.0.0.1:8080</authority>
>   <userinfo>nobody:nothing</userinfo>
>   <host>127.0.0.1</host>
>   <port>8080</port>
>   <path>/dot/slash/dot</path>
>   <query>foo=bar;x=y</query>
> </ID>
> 
> Example SPARQL:    http://spypixel.com/2006/wcl/uri/filter-test2.rq
> (this runs OK in Jena/ARQ eg through the Twinkle GUI)
> 
> Here's the SPARQL example in full. Basically we match the URI 
> descriptions, and then filter against the various strings using the
> query language's FILTER functionality, in particular, regexs, and/or 
> stuff, and exact matching with "=". The lines with a # are commented 
> out. Note that there are some cases here we'll want for testing, eg. 
> case of the URI scheme (hTtp: etc) could easily trip us up.
> 
> PREFIX u: <http://www.w3.org/2004/12/q/idsyntax#>
> SELECT DISTINCT *
> WHERE {
>   ?id a u:ID .
>   ?id u:full ?full .
>   ?id u:nameFor ?res .
>   ?id u:scheme ?scheme .
>   ?id u:authority ?authority .
>   OPTIONAL { ?id u:userinfo ?userinfo } .
>   OPTIONAL { ?id u:host ?host } .
>   OPTIONAL { ?id u:port ?port } .
>   OPTIONAL { ?id u:path ?path } .
>   OPTIONAL { ?id u:query ?query } .
>   OPTIONAL { ?id u:fragment ?fragment } .
> #  FILTER regex ( ?scheme, "http" ) . # schemes matching "http" ie.
> includes https:
> #  FILTER regex ( ?scheme, "^http$" ) . # http: scheme
> #  FILTER regex ( ?scheme, "^HTTP$" ) . # HTTP: scheme (do we normalise
> in the regex or the rdf?)
> #  FILTER regex ( ?scheme, "^http$", "i" ) . # http: scheme, case
> insensitive (more robust)
> # FILTER regex(?scheme,"^http$","i")  && ( (?port = "8080") || (?port =
> "1234") ).
> #FILTER regex(?userinfo, ":") # password is given in the URI
> FILTER regex(?host, "^pics|www\.pics") .
> }
> 
> 
> 
> Easiest way to play with this is to download and run Twinkle from 
> http://www.ldodds.com/projects/twinkle/ and use 
> http://spypixel.com/2006/wcl/uri/_data.rdf as the data URI.
> 
> I've not got it running against the online Redland SPARQL query 
> installation yet, will ask Dave Beckett where the problem is.
> 
> There are a few more comprehensive collections of 'tricky' URIs around, 
> I'm not sure the exact status of any URI test suite but have collected 
> up some links in the bottom of the perl script, reproduced here.
> 
> http://www.w3.org/Addressing/url_test/url_grammar.tests
> http://www.ninebynine.org/Software/HaskellUtils/Network/URITestDescriptions.html 
> 
> http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html
> 
> I've not investigated the IRI side yet, nor taken any care with charset 
> issues (either in the data, or the perl/regex).
> 
> Next steps in the XG? It would be great if someone could try 
> re-expressing the contents of 
> www.w3.org/2005/Incubator/wcl/matching.html or Phil's recent msg 
> http://lists.w3.org/Archives/Member/member-xg-wcl/2006Jun/0079.html 
> (member-only link) using SPARQL filters plus this vocab. For those of us 
> who prefer to do things with XML, I wonder whether the XML resultset 
> format that SPARQL returns would be an acceptable compromise. If we run 
> the above SPARQL query without any filters, it returns the following XML 
> structure --- http://spypixel.com/2006/wcl/uri/_eg_results.txt
> 
> ie. markup like this:
> 
>     <result>
>       <id bnodeid="b0"/>
>       <full>HTTP://example.caps.example.org/</full>
>       <res uri="HTTP://example.caps.example.org/"/>
>       <scheme>HTTP</scheme>
>       <authority>example.caps.example.org</authority>
>       <userinfo bound="false"/>
>       <host>example.caps.example.org</host>
>       <port bound="false"/>
>       <path>/</path>
>       <query bound="false"/>
>       <fragment bound="false"/>
>     </result>
> 
> ...for each result. Am thinking out loud here, not yet quite sure how 
> all these ingredients fit together. And that's without even considering 
> OWL, RIF etc. :)
> 
> cheers,
> 
> Dan
> 
> 
> 
> 

-- 
Phil Archer
Chief Technical Officer, ICRA
t. +44 (0)1473 434770
Skype: philarcher
w. http://www.icra.org/people/philarcher/

Working for a Safer Internet

Received on Monday, 3 July 2006 15:04:53 UTC