Re: Broken Links in LOD Data Sets from Richard Cyganiak on 2009-02-14 (public-lod@w3.org from February 2009)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Sat, 14 Feb 2009 19:05:30 +0000
To: Hugh Glaser <hg@ecs.soton.ac.uk>
Cc: "Hausenblas, Michael" <michael.hausenblas@deri.org>, Kingsley Idehen <kidehen@openlinksw.com>, Bernhard Haslhofer <bernhard.haslhofer@univie.ac.at>, Linked Data community <public-lod@w3.org>
Message-Id: <6518588F-5E74-4B07-BC83-935B918A396A@cyganiak.de>
On 14 Feb 2009, at 15:59, Hugh Glaser wrote:
> Now I think about it, I have checked what dbpedia does to
> http://dbpedia.org/resource/Esperanta  it does the blank doc thing.
> (I guess we need to work out what is best practice for this and then  
> add it
> to the How to Publish? I think my view is that something like
> http://dbpedia.org/data/Esperanta.rdf should 404.)

FWIW, DBpedia does a bit of 404ing:

http://dbpedia.org/page/Esperanta is an empty HTML document
http://dbpedia.org/data/Esperanta is 404
http://dbpedia.org/data/Esperanta.rdf is an empty RDF document

These should all 404, and at least the first one used to on the  
previous incarnation of the DBpedia server software.

Richard


>
> So either way, in LOD sites of the sort that have DBs or KBs behind  
> them,
> either it is not possible to get a 404 (dbpedia), or you can¹t  
> distinguish
> between a rubbish URI that might have been generated and one you  
> want to
> know about.
> I find the idea that I might give people the expectation that I will  
> create
> triples (as your point 2) rather strange - if I knew triples I would  
> have
> served them in the first place. Of course if we consider a URI I  
> don't know
> as a request for me to go and find knowledge about it, fair enough,  
> but I
> would expect a more explicit service for that. In that sense it  
> would not be
> a "broken link".
> Maybe the world is different for the other RDFa etc ways of  
> publishing LD,
> but in the DB/KB world, I don't see broken incoming links as  
> something that
> can be usefully dealt with, other than the maintainer checking what is
> happening, as you do with a normal site.
> ======================================
>
> Now turning to the second possible meaning.
> We are concerned with the place that gave you the URI, which is  
> possibly
> more interesting. And I think this is actually the case for your TAG
> example.
> If I gave you (by which I mean an agent) such a link and you  
> discovered it
> was broken, it would be helpful to me and the LOD world if you could  
> tell me
> about it, so I could fix it. In fact it would also be helpful if you  
> had a
> suggestion as to the fix (ie a better URI), which is not out of the
> question. And if I trust you (when we understand what that means), I  
> might
> even do a replacement or some equivalent triples without further
> intervention.
>
> ======================================
> In the case of our RKB system, we actually do something like this  
> already.
> If we find that there is nothing about a URI in the KB that should  
> have it,
> we don't immediately return 404, but look it up in the associated CRS
> (coreference service), and possibly others, to see if there is an  
> equivalent
> URI in the same KB that could be used (we do not return RDF from  
> other KB,
> although we could). So if you try to resolve
> http://southampton.rkbexplorer.com/description/person-07113
> You actually get the data for
> http://southampton.rkbexplorer.com/id/person-0a36cf76d1a3e99f9267ce3d0b95e42
> e-06999d58799cb8a3a55d3c69efcc9ba6 and a message telling you to use  
> the new
> one next time.
> (I'm not sure we have got the RDF perfectly right, but that is the  
> idea.)
> In effect, if we are asked for a broken link, we have a quick look  
> around to
> see if there is anything we do know, and give that back.
> Of course, the CRS also gives the requestor the chance to do the  
> same fixing
> up.
> The reason that there might be a URI in the KB that has no triples,  
> but we
> know about, is because we "deprecate" URIs to reduce the number, and  
> then
> use the CRS to resolve from deprecated to non-deprecated.
> So a deprecated URI is one we used to know about, and may still be  
> being
> used "out there", but don't want to continue to use - sort of a  
> broken link.
> Hence our dynamic broken link fixing.
>
> Best
> Hugh
>
> PS.
> My choice of http://dbpedia.org/data/Esperanta.rdf as a misspelling of
> http://dbpedia.org/data/Esperanto.rdf turned out to be fascinating.
> It turns out that wikipedia tells me that there used to be a page
> http://en.wikipedia.org/wiki/Esperanta, but it has been deleted.
> So what is returned is different from
> http://en.wikipedia.org/wiki/Esperanti.
> Although http://dbpedia.org/data/Esperanta.rdf and
> http://dbpedia.org/data/Esperanti.rdf both return empty RDF  
> documents, I
> think.
> It looks to me that this is trying to solve a similar problem to  
> that which
> our deprecated URIs is doing in our CRS.
>
>
> On 14/02/2009 14:06, "Hausenblas, Michael" <michael.hausenblas@deri.org 
> >
> wrote:
>
>> Kingsley,
>>
>> Grounding in 404 and 30x makes sense to me. However I am still in the
>> conception phase ;)
>>
>> Sent from my iPhone
>>
>> On 12 Feb 2009, at 14:02, "Kingsley Idehen"  
>> <kidehen@openlinksw.com> wrote:
>>
>>> Michael Hausenblas wrote:
>>>> Bernhard, All,
>>>>
>>>> So, another take on how to deal with broken links: couple of days  
>>>> ago I
>>>> reported two broken links in a TAG finding [1] which was (quickly  
>>>> and
>>>> pragmatically, bravo, TG!) addressed [2], recently.
>>>>
>>>> Let's abstract this away and apply to data rather than documents.  
>>>> The
>>>> mechanism could work as follows:
>>>>
>>>> 1. A *human* (e.g. Through a built-in feature in a Web of Data  
>>>> browser such
>>>> as Tabulator) encounters a broken link an reports it to the  
>>>> respective
>>>> dataset publisher (the authoritative one who 'owns' it)
>>>>
>>>> OR
>>>>
>>>> 1. A machine encounters a broken link (should it then directly  
>>>> ping the
>>>> dataset publisher or first 'ask' its master for permission?)
>>>>
>>>> 2. The dataset publisher acknowledges the broken link and creates  
>>>> according
>>>> triples as done in the case for documents (cf. [2])
>>>>
>>>> In case anyone wants to pick that up, I'm happy to contribute.  
>>>> The name?
>>>> Well, a straw-man proposal could be called *re*pairing *vi*ntage  
>>>> link
>>>> *val*ues (REVIVAL) - anyone? :)
>>>>
>>>> Cheers,
>>>>      Michael
>>>>
>>>> [1] http://lists.w3.org/Archives/Public/www-tag/2009Jan/0118.html
>>>> <http://lists.w3.org/Archives/Public/www-tag/2009Jan/0118.html>
>>>> [2] http://lists.w3.org/Archives/Public/www-tag/2009Feb/0068.html
>>>> <http://lists.w3.org/Archives/Public/www-tag/2009Feb/0068.html>
>>>>
>>>>
>>> Micheal,
>>>
>>> If the publisher is truly dog-fooding and they know what data  
>>> objects
>>> they are publishing, condition 404 should be the trigger for a self
>>> directed query to determine:
>>>
>>> 1. what's happened to the entity URI
>>> 2. lookup similar entities
>>> 3. then self fix if possible (e.g. a 302)
>>>
>>> Basically, Linked Data publishers should make 404s another Linked  
>>> Data
>>> prowess exploitation point  :-)
>>>
>>>
>>> --
>>>
>>>
>>> Regards,
>>>
>>> Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/ 
>>> ~kidehen
>>> <http://www.openlinksw.com/blog/~kidehen>
>>> President & CEO
>>> OpenLink Software     Web: http://www.openlinksw.com
>>> <http://www.openlinksw.com>
>>>
>>>
>>>
>>>
>>
>
>
Received on Saturday, 14 February 2009 19:06:12 UTC