Re: Broken Links in LOD Data Sets from Hausenblas, Michael on 2009-02-14 (public-lod@w3.org from February 2009)

From: Hausenblas, Michael <michael.hausenblas@deri.org>
Date: Sat, 14 Feb 2009 16:32:32 +0000
To: "Hugh Glaser" <hg@ecs.soton.ac.uk>
Cc: "Kingsley Idehen" <kidehen@openlinksw.com>, "Bernhard Haslhofer" <bernhard.haslhofer@univie.ac.at>, "Linked Data community" <public-lod@w3.org>
Message-ID: <A2C1F95B-BC8F-4CEF-A01C-FDD461481D5D@deri.org>
Hugh,

As often, you are right (with my sloppy usage of the term publisher)  
and I think your analysis below is indeed close to what I was thinking  
as well. Let's move over to ESW Wiki and write up stuff. A paste from  
your email might be a good start! Mind minting a URI for it and start  
fill in the Wiki page? I'm on travel and limited re my capabilities  
currently ;)

Cheers, Michael

Sent from my iPhone

On 14 Feb 2009, at 16:00, "Hugh Glaser" <hg@ecs.soton.ac.uk> wrote:

> Hi Michael.
> I got thoroughly confused, I think, by your use of the "dataset  
> publisher
> (the authoritative one who 'owns' it)".
> That made me think you were talking about the owner of the broken  
> URI (ie,
> where it should have resolved to), rather than the place that gave  
> you the
> URI. (Which was it? :-) )
>
> So the next bit is the first of those:
> ======================================
> I think in a lot of the LOD world, a 404 means ³I don¹t know anythin 
> g about
> that URI², rather than a broken link.
> Certainly for us, that is all we can do.
> In fact, what we are actually doing is manually generating the 404  
> when we
> find there is nothing in the KB; we could instead return a blankish  
> RDF
> document, but that didn¹t seem sensible.
> Now I think about it, I have checked what dbpedia does to
> http://dbpedia.org/resource/Esperanta  it does the blank doc thing.
> (I guess we need to work out what is best practice for this and then  
> add it
> to the How to Publish? I think my view is that something like
> http://dbpedia.org/data/Esperanta.rdf should 404.)
> So either way, in LOD sites of the sort that have DBs or KBs behind  
> them,
> either it is not possible to get a 404 (dbpedia), or you can¹t disti 
> nguish
> between a rubbish URI that might have been generated and one you  
> want to
> know about.
> I find the idea that I might give people the expectation that I will  
> create
> triples (as your point 2) rather strange - if I knew triples I would  
> have
> served them in the first place. Of course if we consider a URI I  
> don't know
> as a request for me to go and find knowledge about it, fair enough,  
> but I
> would expect a more explicit service for that. In that sense it  
> would not be
> a "broken link".
> Maybe the world is different for the other RDFa etc ways of  
> publishing LD,
> but in the DB/KB world, I don't see broken incoming links as  
> something that
> can be usefully dealt with, other than the maintainer checking what is
> happening, as you do with a normal site.
> ======================================
>
> Now turning to the second possible meaning.
> We are concerned with the place that gave you the URI, which is  
> possibly
> more interesting. And I think this is actually the case for your TAG
> example.
> If I gave you (by which I mean an agent) such a link and you  
> discovered it
> was broken, it would be helpful to me and the LOD world if you could  
> tell me
> about it, so I could fix it. In fact it would also be helpful if you  
> had a
> suggestion as to the fix (ie a better URI), which is not out of the
> question. And if I trust you (when we understand what that means), I  
> might
> even do a replacement or some equivalent triples without further
> intervention.
>
> ======================================
> In the case of our RKB system, we actually do something like this  
> already.
> If we find that there is nothing about a URI in the KB that should  
> have it,
> we don't immediately return 404, but look it up in the associated CRS
> (coreference service), and possibly others, to see if there is an  
> equivalent
> URI in the same KB that could be used (we do not return RDF from  
> other KB,
> although we could). So if you try to resolve
> http://southampton.rkbexplorer.com/description/person-07113

> You actually get the data for
> http://southampton.rkbexplorer.com/id/person-0a36cf76d1a3e99f9267ce3d0b95e42

> e-06999d58799cb8a3a55d3c69efcc9ba6 and a message telling you to use  
> the new
> one next time.
> (I'm not sure we have got the RDF perfectly right, but that is the  
> idea.)
> In effect, if we are asked for a broken link, we have a quick look  
> around to
> see if there is anything we do know, and give that back.
> Of course, the CRS also gives the requestor the chance to do the  
> same fixing
> up.
> The reason that there might be a URI in the KB that has no triples,  
> but we
> know about, is because we "deprecate" URIs to reduce the number, and  
> then
> use the CRS to resolve from deprecated to non-deprecated.
> So a deprecated URI is one we used to know about, and may still be  
> being
> used "out there", but don't want to continue to use - sort of a  
> broken link.
> Hence our dynamic broken link fixing.
>
> Best
> Hugh
>
> PS.
> My choice of http://dbpedia.org/data/Esperanta.rdf as a misspelling of
> http://dbpedia.org/data/Esperanto.rdf turned out to be fascinating.
> It turns out that wikipedia tells me that there used to be a page
> http://en.wikipedia.org/wiki/Esperanta, but it has been deleted.
> So what is returned is different from
> http://en.wikipedia.org/wiki/Esperanti.

> Although http://dbpedia.org/data/Esperanta.rdf and
> http://dbpedia.org/data/Esperanti.rdf both return empty RDF  
> documents, I
> think.
> It looks to me that this is trying to solve a similar problem to  
> that which
> our deprecated URIs is doing in our CRS.
>
>
> On 14/02/2009 14:06, "Hausenblas, Michael" <michael.hausenblas@deri.org 
> >
> wrote:
>
>> Kingsley,
>>
>> Grounding in 404 and 30x makes sense to me. However I am still in the
>> conception phase ;)
>>
>> Sent from my iPhone
>>
>> On 12 Feb 2009, at 14:02, "Kingsley Idehen"  
>> <kidehen@openlinksw.com> wrote:
>>
>>> Michael Hausenblas wrote:
>>>> Bernhard, All,
>>>>
>>>> So, another take on how to deal with broken links: couple of days  
>>>> ago I
>>>> reported two broken links in a TAG finding [1] which was (quickly  
>>>> and
>>>> pragmatically, bravo, TG!) addressed [2], recently.
>>>>
>>>> Let's abstract this away and apply to data rather than documents.  
>>>> The
>>>> mechanism could work as follows:
>>>>
>>>> 1. A *human* (e.g. Through a built-in feature in a Web of Data  
>>>> browser such
>>>> as Tabulator) encounters a broken link an reports it to the  
>>>> respective
>>>> dataset publisher (the authoritative one who 'owns' it)
>>>>
>>>> OR
>>>>
>>>> 1. A machine encounters a broken link (should it then directly  
>>>> ping the
>>>> dataset publisher or first 'ask' its master for permission?)
>>>>
>>>> 2. The dataset publisher acknowledges the broken link and creates  
>>>> according
>>>> triples as done in the case for documents (cf. [2])
>>>>
>>>> In case anyone wants to pick that up, I'm happy to contribute.  
>>>> The name?
>>>> Well, a straw-man proposal could be called *re*pairing *vi*ntage  
>>>> link
>>>> *val*ues (REVIVAL) - anyone? :)
>>>>
>>>> Cheers,
>>>>      Michael
>>>>
>>>> [1] http://lists.w3.org/Archives/Public/www-tag/2009Jan/0118.html

>>>> <http://lists.w3.org/Archives/Public/www-tag/2009Jan/0118.html>
>>>> [2] http://lists.w3.org/Archives/Public/www-tag/2009Feb/0068.html

>>>> <http://lists.w3.org/Archives/Public/www-tag/2009Feb/0068.html>
>>>>
>>>>
>>> Micheal,
>>>
>>> If the publisher is truly dog-fooding and they know what data  
>>> objects
>>> they are publishing, condition 404 should be the trigger for a self
>>> directed query to determine:
>>>
>>> 1. what's happened to the entity URI
>>> 2. lookup similar entities
>>> 3. then self fix if possible (e.g. a 302)
>>>
>>> Basically, Linked Data publishers should make 404s another Linked  
>>> Data
>>> prowess exploitation point  :-)
>>>
>>>
>>> --
>>>
>>>
>>> Regards,
>>>
>>> Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/ 
>>> ~kidehen
>>> <http://www.openlinksw.com/blog/~kidehen>
>>> President & CEO
>>> OpenLink Software     Web: http://www.openlinksw.com

>>> <http://www.openlinksw.com>
>>>
>>>
>>>
>>>
>>
>
Received on Saturday, 14 February 2009 16:33:19 UTC