Re: Broken Links in LOD Data Sets from Michael Hausenblas on 2009-02-12 (public-lod@w3.org from February 2009)

From: Michael Hausenblas <michael.hausenblas@deri.org>
Date: Thu, 12 Feb 2009 08:43:18 +0000
To: Bernhard Haslhofer <bernhard.haslhofer@univie.ac.at>
CC: Linked Data community <public-lod@w3.org>
Message-ID: <C5B991A6.1D58%michael.hausenblas@deri.org>
Bernhard, All,

I agree that the process should be automated or at least automatable,
however, I guess we need humans in the loop (as often ;).

Now, thinking more about this, I'm unsure actually if this issue should be
addressed on the 'descriptive' level (that is, via voiD ;). To broaden the
discussion, I've put my thoughts together at [1]; the two main 'design'
criteria for the solution, IMHO, are:

1. someone (machine or human)  who *uses* the data *reports* it, and
2. the dataset publisher (rather than a centralised service) *fixes* it.

Cheers,
      Michael
[1] 
http://webofdata.wordpress.com/2009/02/12/how-to-deal-with-broken-data-links
/

-- 
Dr. Michael Hausenblas
DERI - Digital Enterprise Research Institute
National University of Ireland, Lower Dangan,
Galway, Ireland, Europe
Tel. +353 91 495730
http://sw-app.org/about.html


> From: Bernhard Haslhofer <bernhard.haslhofer@univie.ac.at>
> Date: Thu, 12 Feb 2009 09:25:07 +0100
> To: Michael Hausenblas <michael.hausenblas@deri.org>
> Cc: Linked Data community <public-lod@w3.org>
> Subject: Re: Broken Links in LOD Data Sets
> 
> Morning,
> 
> first of all, thanks for your input on that issue. I've started this
> thread because it is always one of the first of questions I get from
> potential content providers. Especially from institutions that must
> guarantee a certain kind of quality in their data, such as libraries.
> If, for instance they link to a LOD-published concept in a thesaurus
> or a DBPedia resource, and these resources change / disappear over
> time it is difficult to provide that kind of quality.
> 
> I partly agree with Kingsley's answer "You have to test for Null
> Pointers (URIs) when programming for the Linked Data Web too" -  this
> is of course true. But imagine programming against a DB which does not
> provide referential integrity  - a nightmare. Besides that I am not
> sure if that is the answer those institutions might expect. Of course,
> in an open world we cannot provide the "quality" features DBMS
> provide, but we can at least provide some mechanism that helps solving
> broken link issues.
> 
> Michael, in my opinion this should be an automated process which can
> 
> 1.) discover broken links - this could be the LOD source itself,
> SINDICE, or any other client. If an LOD source "knows" that all the
> links/references are OK, it could publish that info using voiD -
> @Michael: do you think that makes sense? Maybe introduce
> "void:numberOfTanglingLinks" in the dataset statistics?
> 
> 2.) notify other clients / datasources about broken links - here I
> thought about a kind of iNotify [1] service for LOD sources.
> 
> 3.) fix the problem, if possible, by directing to alternative link
> targets if there are any
> 
> Since we are already working on a service which should provide (2) and
> (3), I would be happy to contribute to a kind of "REVIVAL" thing, or
> whatever you call it ;-)
> 
> Best,
> Bernhard
> 
> 
> 
> [1] http://en.wikipedia.org/wiki/Inotify
> 
> 
> On Feb 12, 2009, at 8:08 AM, Michael Hausenblas wrote:
> 
>> 
>> Bernhard, All,
>> 
>> So, another take on how to deal with broken links: couple of days
>> ago I
>> reported two broken links in a TAG finding [1] which was (quickly and
>> pragmatically, bravo, TG!) addressed [2], recently.
>> 
>> Let's abstract this away and apply to data rather than documents. The
>> mechanism could work as follows:
>> 
>> 1. A *human* (e.g. Through a built-in feature in a Web of Data
>> browser such
>> as Tabulator) encounters a broken link an reports it to the respective
>> dataset publisher (the authoritative one who 'owns' it)
>> 
>> OR
>> 
>> 1. A machine encounters a broken link (should it then directly ping
>> the
>> dataset publisher or first 'ask' its master for permission?)
>> 
>> 2. The dataset publisher acknowledges the broken link and creates
>> according
>> triples as done in the case for documents (cf. [2])
>> 
>> In case anyone wants to pick that up, I'm happy to contribute. The
>> name?
>> Well, a straw-man proposal could be called *re*pairing *vi*ntage link
>> *val*ues (REVIVAL) - anyone? :)
>> 
>> Cheers,
>>      Michael
>> 
>> [1] http://lists.w3.org/Archives/Public/www-tag/2009Jan/0118.html
>> [2] http://lists.w3.org/Archives/Public/www-tag/2009Feb/0068.html
>> 
>> -- 
>> Dr. Michael Hausenblas
>> DERI - Digital Enterprise Research Institute
>> National University of Ireland, Lower Dangan,
>> Galway, Ireland, Europe
>> Tel. +353 91 495730
>> http://sw-app.org/about.html
>> 
>> 
>>> From: Bernhard Haslhofer <bernhard.haslhofer@univie.ac.at>
>>> Date: Thu, 5 Feb 2009 16:35:35 +0100
>>> To: Linked Data community <public-lod@w3.org>
>>> Subject: Broken Links in LOD Data Sets
>>> Resent-From: Linked Data community <public-lod@w3.org>
>>> Resent-Date: Thu, 05 Feb 2009 15:36:13 +0000
>>> 
>>> 
>>> Hi all,
>>> 
>>> we are currently working on the question how to deal with broken
>>> links/
>>> references between resources in (distinct) LOD data sets and would
>>> like to know your opinion on that issue. If there is some work going
>>> on into this direction, please let me know.
>>> 
>>> I think I do not really need to explain the problem. Everybody knows
>>> it from the "human" Web when you follow a link and you get an
>>> annoying
>>> 404 response.
>>> 
>>> If we assume that the consumers of LOD data are not humans but
>>> applications, broken links/references are not only "annoying" but
>>> could lead to severe processing errors if an application relies on a
>>> kind of "referential integrity".
>>> 
>>> Assume we have an LOD data source X exposing resources that describe
>>> images and these images are linked with resources in DBPedia (e.g.,
>>> http://dbpedia.org/resource/Berlin)
>>> . An application built on-top of X follows links to retrieve the geo-
>>> coordinates in order to display the images on a virtual map. If now,
>>> for some reason, the URL of the linked DB-Pedia resource changes
>>> either because DBPedia is moved or re-organized, which I guess could
>>> happen to any LOD source in a long-term perspective, the application
>>> might crash if doesn't consider that referenced resources might move
>>> or become unavailable.
>>> 
>>> I know that "cool URIs don't change" but I am not sure if this
>>> assumption holds in practice, especially in a long-term perspective.
>>> 
>>> For the "human" Web several solutions have been proposed, e.g.,
>>> 1.) PURL and DOI services for translating URNs into resolvable URLs
>>> 2.) forward references
>>> 3.) robust link implementations, i.e., with each link you keep a set
>>> of related search terms to retrieve moved / changed resources
>>> 4.) observer / notification mechanisms
>>> X.) ?
>>> 
>>> I guess (1) is not really applicable for LOD resources because of
>>> scalability and single-point of failure issues. (2) would require
>>> that
>>> LOD providers take care of setting up HTTP redirects for their moved
>>> resources - no idea if anybody will do that in reality and how this
>>> can scale. (3) could help to re-locate moved resources via search
>>> engines like Sindice but not really fully automatically. (4) could at
>>> least inform a data source that certain references are broken and it
>>> could remove them.
>>> 
>>> Another alternative is of course to completely leave the problem to
>>> the application developers, which means that they must consider
>>> that a
>>> referenced resource might exist or not. I am not sure about the
>>> practical consequences of that approach, especially if several data
>>> sources are involved, but I have the feeling that it is getting
>>> really
>>> complicated if one cannot rely on any kind of referential integrity.
>>> 
>>> Are there any existing mechanism that can give us at least some basic
>>> feedback about the "quality" of an LOD data source? I think, the
>>> referential integrity could be such a quality property...
>>> 
>>> Thanks for your input on that issue,
>>> 
>>> Bernhard
>>> 
>>> ______________________________________________________
>>> Research Group Multimedia Information Systems
>>> Department of Distributed and Multimedia Systems
>>> Faculty of Computer Science
>>> University of Vienna
>>> 
>>> Postal Address: Liebiggasse 4/3-4, 1010 Vienna, Austria
>>> Phone: +43 1 42 77 39635 Fax: +43 1 4277 39649
>>> E-Mail: bernhard.haslhofer@univie.ac.at
>>> WWW: http://www.cs.univie.ac.at/bernhard.haslhofer
>>> 
>>> 
>> 
>> 
> 
> ______________________________________________________
> Research Group Multimedia Information Systems
> Department of Distributed and Multimedia Systems
> Faculty of Computer Science
> University of Vienna
> 
> Postal Address: Liebiggasse 4/3-4, 1010 Vienna, Austria
> Phone: +43 1 42 77 39635 Fax: +43 1 4277 39649
> E-Mail: bernhard.haslhofer@univie.ac.at
> WWW: http://www.cs.univie.ac.at/bernhard.haslhofer
>
Received on Thursday, 12 February 2009 08:44:00 UTC