FW: Matching same resources but with varying URL schemes (http / https)

Colleagues,

 

Nice little piece from Hugh on reconciling vagaries in URIs (the query was
in the form of

what to do with http:// and https:// for the "same" URL/URI).

 

Hugh's ideas based on how he manages the multiple links associated with a
"canonical"

URI.

 

Jerry

 

  _____  

-----Original Message-----
From: Hugh Glaser [mailto:hg@ecs.soton.ac.uk] 
Sent: Thursday, July 04, 2013 2:11 PM
To: Steve Harris; Olivier Berger
Cc: public-lod@w3.org
Subject: Re: Matching same ressources but with varying URL schemes (http /
https)

 

Hi Olivier,

Great problem, and something that happens quite a lot.

Of course, even worse is when the URI is the redirected one (
<http://dbpedia.org/page/Luton> http://dbpedia.org/page/Luton instead of
<http://dbpedia.org/resource/Luton> http://dbpedia.org/resource/Luton) - do
you simply reject the data, or try and patch something up? Or they forget
the #me or #person on the end.

Then there are things like where the site has worried about misspellings
etc. (look at dbpedia URIs for
<http://sameas.org/?uri=http://dbpedia.org/resource/Tim_Berners-Lee>
http://sameas.org/?uri=http://dbpedia.org/resource/Tim_Berners-Lee ) Or the
VIAF ones in the same page.

And see also the freebase ones with both a . and a / after the m, etc.

And you also get the dated/versioned URI versus the "generic" URI.

 

Anyway, my approach is similar to Steve's, but I use a sameAs store (of
course :-) ).

I don't actually want to pollute my store of what I think of as the
"knowledge" with all this identity bookkeeping, which I certainly need.

So all URIs that go in the endpoint are the canons from an associated sameAs
store.

Think of it as an identity management KB.

The sameAs service names a URI as the canon, which of course it does for
exactly this purpose.

And I usually make sure that the canon it names is the first URI it ever
got, so that I don't have to do any URI rewriting of the RDF in the store,
or similar.

Then the sameAs store is used as a lookup to rewrite the URIs in the
assertions and queries to the canons.

 

Also, I have never done it, because I'm not crawling the web like you, but
it would be fun to have a little script in the import phase that decided if
the URIs were sufficiently similar to be candidates for this treatment. You
could then resolve them and compare the pages, and apply a heuristic (90%?)
to catch the case when something strange had been done for http v. https,
for example.

 

I hope you will report back what you do!

 

Best

Hugh

 

  _____  

> 

> On 2013-07-04, at 16:49, Olivier Berger <
<mailto:olivier.berger@telecom-sudparis.eu>
olivier.berger@telecom-sudparis.eu> wrote:

> 

>> Hi.

>> 

>> I hope such "design pattern" questions on consuming Linked Open Data 

>> are OT... otherwise, please suggest an appropriate venue for 

>> questions ;)

>> 

>> 

>> I'm trying to figure out potential patterns for designing an 

>> application /consuming/ Linked Data, typically using SPARQL over a 

>> local Virtuoso triple store which was fed with harvested Linked Open
Data.

>> 

>> I happen to find resources sometimes identified with http, sometimes 

>> with https, which otherwise reference the same URL. Other issues may 

>> be the use or not of a trailing slash for dir-like URLs.

>> 

>> For instance, I'd like to match as "identical" two doap:Projects 

>> resources which have "same" doap:homepage if I can match 

>>  <http://project1/example.com/home/> http://project1/example.com/home/
and 

>>  <https://project1/example.com/home/> https://project1/example.com/home/

>> 

>> 

>> It may happen that a document is rendered the same by the publishing 

>> service, whichever way it is accessed, so I'd like to consider that 

>> referencing it via URIs which contain htpp:// or https:// is equivalent.

>> 

>> Or a service may have chosen to adopt https:// as a canonical URI for 

>> instance, but it may happen that users reference it via http 

>> somewhere else...

>> 

>> Obviously, direct matching of the same ?h URIRef won't work in basic 

>> SPARQL queries like :

>> PREFIX doap:  < <http://usefulinc.com/ns/doap#>
http://usefulinc.com/ns/doap#>

>> 

>> SELECT *

>> {

>> GRAPH 

>> <htpp://myapp.example.com/graphs?source=http://publisher1.example.com/> {
?dp doap:homepage ?h.

>>  ?dp doap:name ?dn

>> }

>> GRAPH 

>> <htpp://myapp.example.com/graphs?source=https://publisher2.example.com/>
{  ?ap doap:homepage ?h.

>>  ?ap doap:name ?an

>> }

>> }

>> 

>> I can think of a sort of Regexp matching on the string after '://' 

>> but I doubt to get good performance ;-)

>> 

>> Is there a way to create indexes over some URIs, or owl:sameAs 

>> relations to manage such URI matching in queries ? Or am I left to 

>> "normalizing" my URLs in the harvested data before storing them in the
triple store ?

>> 

>> Would you think there's a reasonably standard approach... or one that 

>> would work with Virtuoso 6.1.3 ? ;)

>> 

>> I imagine that this is a kinda FAQ for consuming Linked (Open) 

>> Data... but it seems many more people are concerned on publishing 

>> than on consuming in public discussions ;-)

>> 

>> 

>> Thanks in advance.

>> 

>> P.S.: already posted a similar question on 

>>  <http://answers.semanticweb.com/questions/23584/matching-ressources-wi>
http://answers.semanticweb.com/questions/23584/matching-ressources-wi

>> th-variying-url-scheme-http-https

>> --

>> Olivier BERGER

>>  <http://www-public.telecom-sudparis.eu/~berger_o/>
http://www-public.telecom-sudparis.eu/~berger_o/ - OpenPGP-Id: 

>> 2048R/5819D7E8 Ingenieur Recherche - Dept INF Institut Mines-Telecom, 

>> Telecom SudParis, Evry (France)

>> 

>> 

> 

> --

> Steve Harris

> Experian

> +44 20 3042 4132

> Registered in England and Wales 653331 VAT # 887 1335 93

> 80 Victoria Street, London, SW1E 5JL

> 

> 

 

 

Received on Friday, 5 July 2013 16:40:39 UTC