Re: Broken Links in LOD Data Sets from Kingsley Idehen on 2009-02-14 (public-lod@w3.org from February 2009)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Sat, 14 Feb 2009 14:27:20 -0500
To: Richard Cyganiak <richard@cyganiak.de>
CC: Hugh Glaser <hg@ecs.soton.ac.uk>, "Hausenblas, Michael" <michael.hausenblas@deri.org>, Bernhard Haslhofer <bernhard.haslhofer@univie.ac.at>, Linked Data community <public-lod@w3.org>
Message-ID: <49971B18.7090801@openlinksw.com>
Richard Cyganiak wrote:
>
> On 14 Feb 2009, at 15:59, Hugh Glaser wrote:
>> Now I think about it, I have checked what dbpedia does to
>> http://dbpedia.org/resource/Esperanta  it does the blank doc thing.
>> (I guess we need to work out what is best practice for this and then 
>> add it
>> to the How to Publish? I think my view is that something like
>> http://dbpedia.org/data/Esperanta.rdf should 404.)
>
> FWIW, DBpedia does a bit of 404ing:
>
> http://dbpedia.org/page/Esperanta is an empty HTML document
> http://dbpedia.org/data/Esperanta is 404
> http://dbpedia.org/data/Esperanta.rdf is an empty RDF document
>
> These should all 404, and at least the first one used to on the 
> previous incarnation of the DBpedia server software.
Richard,

We'll deal with it.

It can 404 or smartly do something like:
http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&should-sponge=&query=select+distinct+*+where+{%3Fs+%3Fp+%3Fo.+%3Fo+bif%3Acontains+%22Esperanta%22}&format=text%2Fhtml&debug=on

Make a suggestion doc on the fly.

>
> Richard
>
>
>>
>> So either way, in LOD sites of the sort that have DBs or KBs behind 
>> them,
>> either it is not possible to get a 404 (dbpedia), or you can¹t 
>> distinguish
>> between a rubbish URI that might have been generated and one you want to
>> know about.
>> I find the idea that I might give people the expectation that I will 
>> create
>> triples (as your point 2) rather strange - if I knew triples I would 
>> have
>> served them in the first place. Of course if we consider a URI I 
>> don't know
>> as a request for me to go and find knowledge about it, fair enough, 
>> but I
>> would expect a more explicit service for that. In that sense it would 
>> not be
>> a "broken link".
>> Maybe the world is different for the other RDFa etc ways of 
>> publishing LD,
>> but in the DB/KB world, I don't see broken incoming links as 
>> something that
>> can be usefully dealt with, other than the maintainer checking what is
>> happening, as you do with a normal site.
>> ======================================
>>
>> Now turning to the second possible meaning.
>> We are concerned with the place that gave you the URI, which is possibly
>> more interesting. And I think this is actually the case for your TAG
>> example.
>> If I gave you (by which I mean an agent) such a link and you 
>> discovered it
>> was broken, it would be helpful to me and the LOD world if you could 
>> tell me
>> about it, so I could fix it. In fact it would also be helpful if you 
>> had a
>> suggestion as to the fix (ie a better URI), which is not out of the
>> question. And if I trust you (when we understand what that means), I 
>> might
>> even do a replacement or some equivalent triples without further
>> intervention.
>>
>> ======================================
>> In the case of our RKB system, we actually do something like this 
>> already.
>> If we find that there is nothing about a URI in the KB that should 
>> have it,
>> we don't immediately return 404, but look it up in the associated CRS
>> (coreference service), and possibly others, to see if there is an 
>> equivalent
>> URI in the same KB that could be used (we do not return RDF from 
>> other KB,
>> although we could). So if you try to resolve
>> http://southampton.rkbexplorer.com/description/person-07113
>> You actually get the data for
>> http://southampton.rkbexplorer.com/id/person-0a36cf76d1a3e99f9267ce3d0b95e42 
>>
>> e-06999d58799cb8a3a55d3c69efcc9ba6 and a message telling you to use 
>> the new
>> one next time.
>> (I'm not sure we have got the RDF perfectly right, but that is the 
>> idea.)
>> In effect, if we are asked for a broken link, we have a quick look 
>> around to
>> see if there is anything we do know, and give that back.
>> Of course, the CRS also gives the requestor the chance to do the same 
>> fixing
>> up.
>> The reason that there might be a URI in the KB that has no triples, 
>> but we
>> know about, is because we "deprecate" URIs to reduce the number, and 
>> then
>> use the CRS to resolve from deprecated to non-deprecated.
>> So a deprecated URI is one we used to know about, and may still be being
>> used "out there", but don't want to continue to use - sort of a 
>> broken link.
>> Hence our dynamic broken link fixing.
>>
>> Best
>> Hugh
>>
>> PS.
>> My choice of http://dbpedia.org/data/Esperanta.rdf as a misspelling of
>> http://dbpedia.org/data/Esperanto.rdf turned out to be fascinating.
>> It turns out that wikipedia tells me that there used to be a page
>> http://en.wikipedia.org/wiki/Esperanta, but it has been deleted.
>> So what is returned is different from
>> http://en.wikipedia.org/wiki/Esperanti.
>> Although http://dbpedia.org/data/Esperanta.rdf and
>> http://dbpedia.org/data/Esperanti.rdf both return empty RDF documents, I
>> think.
>> It looks to me that this is trying to solve a similar problem to that 
>> which
>> our deprecated URIs is doing in our CRS.
>>
>>
>> On 14/02/2009 14:06, "Hausenblas, Michael" <michael.hausenblas@deri.org>
>> wrote:
>>
>>> Kingsley,
>>>
>>> Grounding in 404 and 30x makes sense to me. However I am still in the
>>> conception phase ;)
>>>
>>> Sent from my iPhone
>>>
>>> On 12 Feb 2009, at 14:02, "Kingsley Idehen" <kidehen@openlinksw.com> 
>>> wrote:
>>>
>>>> Michael Hausenblas wrote:
>>>>> Bernhard, All,
>>>>>
>>>>> So, another take on how to deal with broken links: couple of days 
>>>>> ago I
>>>>> reported two broken links in a TAG finding [1] which was (quickly and
>>>>> pragmatically, bravo, TG!) addressed [2], recently.
>>>>>
>>>>> Let's abstract this away and apply to data rather than documents. The
>>>>> mechanism could work as follows:
>>>>>
>>>>> 1. A *human* (e.g. Through a built-in feature in a Web of Data 
>>>>> browser such
>>>>> as Tabulator) encounters a broken link an reports it to the 
>>>>> respective
>>>>> dataset publisher (the authoritative one who 'owns' it)
>>>>>
>>>>> OR
>>>>>
>>>>> 1. A machine encounters a broken link (should it then directly 
>>>>> ping the
>>>>> dataset publisher or first 'ask' its master for permission?)
>>>>>
>>>>> 2. The dataset publisher acknowledges the broken link and creates 
>>>>> according
>>>>> triples as done in the case for documents (cf. [2])
>>>>>
>>>>> In case anyone wants to pick that up, I'm happy to contribute. The 
>>>>> name?
>>>>> Well, a straw-man proposal could be called *re*pairing *vi*ntage link
>>>>> *val*ues (REVIVAL) - anyone? :)
>>>>>
>>>>> Cheers,
>>>>>      Michael
>>>>>
>>>>> [1] http://lists.w3.org/Archives/Public/www-tag/2009Jan/0118.html
>>>>> <http://lists.w3.org/Archives/Public/www-tag/2009Jan/0118.html>
>>>>> [2] http://lists.w3.org/Archives/Public/www-tag/2009Feb/0068.html
>>>>> <http://lists.w3.org/Archives/Public/www-tag/2009Feb/0068.html>
>>>>>
>>>>>
>>>> Micheal,
>>>>
>>>> If the publisher is truly dog-fooding and they know what data objects
>>>> they are publishing, condition 404 should be the trigger for a self
>>>> directed query to determine:
>>>>
>>>> 1. what's happened to the entity URI
>>>> 2. lookup similar entities
>>>> 3. then self fix if possible (e.g. a 302)
>>>>
>>>> Basically, Linked Data publishers should make 404s another Linked Data
>>>> prowess exploitation point  :-)
>>>>
>>>>
>>>> -- 
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen
>>>> <http://www.openlinksw.com/blog/~kidehen>
>>>> President & CEO
>>>> OpenLink Software     Web: http://www.openlinksw.com
>>>> <http://www.openlinksw.com>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>
>


-- 


Regards,

Kingsley Idehen	      Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO 
OpenLink Software     Web: http://www.openlinksw.com
Received on Saturday, 14 February 2009 19:28:02 UTC