Re: Matching same ressources but with varying URL schemes (http / https) from Hugh Glaser on 2013-07-04 (public-lod@w3.org from July 2013)

From: Hugh Glaser <hg@ecs.soton.ac.uk>
Date: Thu, 4 Jul 2013 21:10:40 +0000
To: Steve Harris <steve.harris@garlik.com>, Olivier Berger <olivier.berger@telecom-sudparis.eu>
CC: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <E18BDE5D-1038-4E44-8731-725A50080938@soton.ac.uk>
Hi Olivier,
Great problem, and something that happens quite a lot.
Of course, even worse is when the URI is the redirected one (http://dbpedia.org/page/Luton instead of http://dbpedia.org/resource/Luton) - do you simply reject the data, or try and patch something up? Or they forget the #me or #person on the end.
Then there are things like where the site has worried about misspellings etc. (look at dbpedia URIs for http://sameas.org/?uri=http://dbpedia.org/resource/Tim_Berners-Lee )
Or the VIAF ones in the same page.
And see also the freebase ones with both a . and a / after the m, etc.
And you also get the dated/versioned URI versus the "generic" URI.

Anyway, my approach is similar to Steve's, but I use a sameAs store (of course :-) ).
I don't actually want to pollute my store of what I think of as the "knowledge" with all this identity bookkeeping, which I certainly need.
So all URIs that go in the endpoint are the canons from an associated sameAs store.
Think of it as an identity management KB.
The sameAs service names a URI as the canon, which of course it does for exactly this purpose.
And I usually make sure that the canon it names is the first URI it ever got, so that I don't have to do any URI rewriting of the RDF in the store, or similar.
Then the sameAs store is used as a lookup to rewrite the URIs in the assertions and queries to the canons.

Also, I have never done it, because I'm not crawling the web like you, but it would be fun to have a little script in the import phase that decided if the URIs were sufficiently similar to be candidates for this treatment. You could then resolve them and compare the pages, and apply a heuristic (90%?) to catch the case when something strange had been done for http v. https, for example.

I hope you will report back what you do!

Best
Hugh

On 4 Jul 2013, at 17:12, Steve Harris <steve.harris@garlik.com>
 wrote:

> Of course you have to handle the case where http://foo.example/ and https://foo.example/ are materially different too…
> 
> One approach would be to have some sort of property like "canonical URI", which you can use for your matching, then you can lean on the triplestore's built in URI indexing.
> 
> {
>  <foo> doap:homepage <https://Foo.example> .
>  <foo> doap:name "Foo" .
>  <https://Foo.example> :canonicalUri <http://foo.example/>
> }
> 
> etc.
> 
> Then you can do
> 
> SELECT *
> WHERE {
>   ?x doap:homepage ?hp . 
>   ?hp :canonicalUri <http://foo.example/> .
> }
> 
> And you can have different canonical URIs for sites with different http and https content, if you want to.
> 
> I would avoid tying yourself to any non-standard SPARQL / RDF extensions, that way you avoid limiting yourself to a particular triplestore vendor.
> 
> - Steve
> 
> On 2013-07-04, at 16:49, Olivier Berger <olivier.berger@telecom-sudparis.eu> wrote:
> 
>> Hi.
>> 
>> I hope such "design pattern" questions on consuming Linked Open Data are
>> OT... otherwise, please suggest an appropriate venue for questions ;)
>> 
>> 
>> I'm trying to figure out potential patterns for designing an application
>> /consuming/ Linked Data, typically using SPARQL over a local Virtuoso
>> triple store which was fed with harvested Linked Open Data.
>> 
>> I happen to find resources sometimes identified with http, sometimes
>> with https, which otherwise reference the same URL. Other issues may be
>> the use or not of a trailing slash for dir-like URLs.
>> 
>> For instance, I'd like to match as "identical" two doap:Projects resources
>> which have "same" doap:homepage if I can match
>> http://project1/example.com/home/ and https://project1/example.com/home/
>> 
>> 
>> It may happen that a document is rendered the same by the publishing
>> service, whichever way it is accessed, so I'd like to consider that
>> referencing it via URIs which contain htpp:// or https:// is equivalent.
>> 
>> Or a service may have chosen to adopt https:// as a canonical URI for
>> instance, but it may happen that users reference it via http somewhere
>> else... 
>> 
>> Obviously, direct matching of the same ?h URIRef won't work
>> in basic SPARQL queries like :
>> PREFIX doap:  <http://usefulinc.com/ns/doap#>
>> 
>> SELECT *
>> {
>> GRAPH <htpp://myapp.example.com/graphs?source=http://publisher1.example.com/> {
>>  ?dp doap:homepage ?h.
>>  ?dp doap:name ?dn
>> }
>> GRAPH <htpp://myapp.example.com/graphs?source=https://publisher2.example.com/> {
>>  ?ap doap:homepage ?h.
>>  ?ap doap:name ?an
>> }
>> }
>> 
>> I can think of a sort of Regexp matching on the string after '://' but I
>> doubt to get good performance ;-)
>> 
>> Is there a way to create indexes over some URIs, or owl:sameAs relations to
>> manage such URI matching in queries ? Or am I left to "normalizing" my
>> URLs in the harvested data before storing them in the triple store ?
>> 
>> Would you think there's a reasonably standard approach... or one that
>> would work with Virtuoso 6.1.3 ? ;)
>> 
>> I imagine that this is a kinda FAQ for consuming Linked (Open)
>> Data... but it seems many more people are concerned on publishing than
>> on consuming in public discussions ;-)
>> 
>> 
>> Thanks in advance.
>> 
>> P.S.: already posted a similar question on
>> http://answers.semanticweb.com/questions/23584/matching-ressources-with-variying-url-scheme-http-https
>> -- 
>> Olivier BERGER 
>> http://www-public.telecom-sudparis.eu/~berger_o/ - OpenPGP-Id: 2048R/5819D7E8
>> Ingenieur Recherche - Dept INF
>> Institut Mines-Telecom, Telecom SudParis, Evry (France)
>> 
>> 
> 
> -- 
> Steve Harris
> Experian
> +44 20 3042 4132
> Registered in England and Wales 653331 VAT # 887 1335 93
> 80 Victoria Street, London, SW1E 5JL
> 
>
Received on Thursday, 4 July 2013 21:11:27 UTC