Matching same ressources but with varying URL schemes (http / https)

Hi.

I hope such "design pattern" questions on consuming Linked Open Data are
OT... otherwise, please suggest an appropriate venue for questions ;)


I'm trying to figure out potential patterns for designing an application
/consuming/ Linked Data, typically using SPARQL over a local Virtuoso
triple store which was fed with harvested Linked Open Data.

I happen to find resources sometimes identified with http, sometimes
with https, which otherwise reference the same URL. Other issues may be
the use or not of a trailing slash for dir-like URLs.

For instance, I'd like to match as "identical" two doap:Projects resources
which have "same" doap:homepage if I can match
http://project1/example.com/home/ and https://project1/example.com/home/


It may happen that a document is rendered the same by the publishing
service, whichever way it is accessed, so I'd like to consider that
referencing it via URIs which contain htpp:// or https:// is equivalent.

Or a service may have chosen to adopt https:// as a canonical URI for
instance, but it may happen that users reference it via http somewhere
else... 

Obviously, direct matching of the same ?h URIRef won't work
in basic SPARQL queries like :
PREFIX doap:  <http://usefulinc.com/ns/doap#>

SELECT *
{
  GRAPH <htpp://myapp.example.com/graphs?source=http://publisher1.example.com/> {
   ?dp doap:homepage ?h.
   ?dp doap:name ?dn
  }
  GRAPH <htpp://myapp.example.com/graphs?source=https://publisher2.example.com/> {
   ?ap doap:homepage ?h.
   ?ap doap:name ?an
  }
}

I can think of a sort of Regexp matching on the string after '://' but I
doubt to get good performance ;-)

Is there a way to create indexes over some URIs, or owl:sameAs relations to
manage such URI matching in queries ? Or am I left to "normalizing" my
URLs in the harvested data before storing them in the triple store ?

Would you think there's a reasonably standard approach... or one that
would work with Virtuoso 6.1.3 ? ;)

I imagine that this is a kinda FAQ for consuming Linked (Open)
Data... but it seems many more people are concerned on publishing than
on consuming in public discussions ;-)


Thanks in advance.

P.S.: already posted a similar question on
http://answers.semanticweb.com/questions/23584/matching-ressources-with-variying-url-scheme-http-https
-- 
Olivier BERGER 
http://www-public.telecom-sudparis.eu/~berger_o/ - OpenPGP-Id: 2048R/5819D7E8
Ingenieur Recherche - Dept INF
Institut Mines-Telecom, Telecom SudParis, Evry (France)

Received on Thursday, 4 July 2013 15:50:24 UTC