Re: Matching same ressources but with varying URL schemes (http / https) from Steve Harris on 2013-07-04 (public-lod@w3.org from July 2013)

From: Steve Harris <steve.harris@garlik.com>
Date: Thu, 4 Jul 2013 17:12:11 +0100
To: Olivier Berger <olivier.berger@telecom-sudparis.eu>
Cc: public-lod@w3.org
Message-Id: <372747EB-E7FF-48C8-A564-23DEC2199979@garlik.com>
Of course you have to handle the case where http://foo.example/ and https://foo.example/ are materially different too…

One approach would be to have some sort of property like "canonical URI", which you can use for your matching, then you can lean on the triplestore's built in URI indexing.

{
  <foo> doap:homepage <https://Foo.example> .
  <foo> doap:name "Foo" .
  <https://Foo.example> :canonicalUri <http://foo.example/>
}

etc.

Then you can do

SELECT *
WHERE {
   ?x doap:homepage ?hp . 
   ?hp :canonicalUri <http://foo.example/> .
}

And you can have different canonical URIs for sites with different http and https content, if you want to.

I would avoid tying yourself to any non-standard SPARQL / RDF extensions, that way you avoid limiting yourself to a particular triplestore vendor.

- Steve

On 2013-07-04, at 16:49, Olivier Berger <olivier.berger@telecom-sudparis.eu> wrote:

> Hi.
> 
> I hope such "design pattern" questions on consuming Linked Open Data are
> OT... otherwise, please suggest an appropriate venue for questions ;)
> 
> 
> I'm trying to figure out potential patterns for designing an application
> /consuming/ Linked Data, typically using SPARQL over a local Virtuoso
> triple store which was fed with harvested Linked Open Data.
> 
> I happen to find resources sometimes identified with http, sometimes
> with https, which otherwise reference the same URL. Other issues may be
> the use or not of a trailing slash for dir-like URLs.
> 
> For instance, I'd like to match as "identical" two doap:Projects resources
> which have "same" doap:homepage if I can match
> http://project1/example.com/home/ and https://project1/example.com/home/
> 
> 
> It may happen that a document is rendered the same by the publishing
> service, whichever way it is accessed, so I'd like to consider that
> referencing it via URIs which contain htpp:// or https:// is equivalent.
> 
> Or a service may have chosen to adopt https:// as a canonical URI for
> instance, but it may happen that users reference it via http somewhere
> else... 
> 
> Obviously, direct matching of the same ?h URIRef won't work
> in basic SPARQL queries like :
> PREFIX doap:  <http://usefulinc.com/ns/doap#>
> 
> SELECT *
> {
>  GRAPH <htpp://myapp.example.com/graphs?source=http://publisher1.example.com/> {
>   ?dp doap:homepage ?h.
>   ?dp doap:name ?dn
>  }
>  GRAPH <htpp://myapp.example.com/graphs?source=https://publisher2.example.com/> {
>   ?ap doap:homepage ?h.
>   ?ap doap:name ?an
>  }
> }
> 
> I can think of a sort of Regexp matching on the string after '://' but I
> doubt to get good performance ;-)
> 
> Is there a way to create indexes over some URIs, or owl:sameAs relations to
> manage such URI matching in queries ? Or am I left to "normalizing" my
> URLs in the harvested data before storing them in the triple store ?
> 
> Would you think there's a reasonably standard approach... or one that
> would work with Virtuoso 6.1.3 ? ;)
> 
> I imagine that this is a kinda FAQ for consuming Linked (Open)
> Data... but it seems many more people are concerned on publishing than
> on consuming in public discussions ;-)
> 
> 
> Thanks in advance.
> 
> P.S.: already posted a similar question on
> http://answers.semanticweb.com/questions/23584/matching-ressources-with-variying-url-scheme-http-https
> -- 
> Olivier BERGER 
> http://www-public.telecom-sudparis.eu/~berger_o/ - OpenPGP-Id: 2048R/5819D7E8
> Ingenieur Recherche - Dept INF
> Institut Mines-Telecom, Telecom SudParis, Evry (France)
> 
> 

-- 
Steve Harris
Experian
+44 20 3042 4132
Registered in England and Wales 653331 VAT # 887 1335 93
80 Victoria Street, London, SW1E 5JL
Received on Thursday, 4 July 2013 16:12:40 UTC