Re: Broken Links in LOD Data Sets

Morning,

first of all, thanks for your input on that issue. I've started this  
thread because it is always one of the first of questions I get from  
potential content providers. Especially from institutions that must  
guarantee a certain kind of quality in their data, such as libraries.  
If, for instance they link to a LOD-published concept in a thesaurus  
or a DBPedia resource, and these resources change / disappear over  
time it is difficult to provide that kind of quality.

I partly agree with Kingsley's answer "You have to test for Null  
Pointers (URIs) when programming for the Linked Data Web too" -  this  
is of course true. But imagine programming against a DB which does not  
provide referential integrity  - a nightmare. Besides that I am not  
sure if that is the answer those institutions might expect. Of course,  
in an open world we cannot provide the "quality" features DBMS  
provide, but we can at least provide some mechanism that helps solving  
broken link issues.

Michael, in my opinion this should be an automated process which can

1.) discover broken links - this could be the LOD source itself,  
SINDICE, or any other client. If an LOD source "knows" that all the  
links/references are OK, it could publish that info using voiD -  
@Michael: do you think that makes sense? Maybe introduce  
"void:numberOfTanglingLinks" in the dataset statistics?

2.) notify other clients / datasources about broken links - here I  
thought about a kind of iNotify [1] service for LOD sources.

3.) fix the problem, if possible, by directing to alternative link  
targets if there are any

Since we are already working on a service which should provide (2) and  
(3), I would be happy to contribute to a kind of "REVIVAL" thing, or  
whatever you call it ;-)

Best,
Bernhard



[1] http://en.wikipedia.org/wiki/Inotify


On Feb 12, 2009, at 8:08 AM, Michael Hausenblas wrote:

>
> Bernhard, All,
>
> So, another take on how to deal with broken links: couple of days  
> ago I
> reported two broken links in a TAG finding [1] which was (quickly and
> pragmatically, bravo, TG!) addressed [2], recently.
>
> Let's abstract this away and apply to data rather than documents. The
> mechanism could work as follows:
>
> 1. A *human* (e.g. Through a built-in feature in a Web of Data  
> browser such
> as Tabulator) encounters a broken link an reports it to the respective
> dataset publisher (the authoritative one who 'owns' it)
>
> OR
>
> 1. A machine encounters a broken link (should it then directly ping  
> the
> dataset publisher or first 'ask' its master for permission?)
>
> 2. The dataset publisher acknowledges the broken link and creates  
> according
> triples as done in the case for documents (cf. [2])
>
> In case anyone wants to pick that up, I'm happy to contribute. The  
> name?
> Well, a straw-man proposal could be called *re*pairing *vi*ntage link
> *val*ues (REVIVAL) - anyone? :)
>
> Cheers,
>      Michael
>
> [1] http://lists.w3.org/Archives/Public/www-tag/2009Jan/0118.html
> [2] http://lists.w3.org/Archives/Public/www-tag/2009Feb/0068.html
>
> -- 
> Dr. Michael Hausenblas
> DERI - Digital Enterprise Research Institute
> National University of Ireland, Lower Dangan,
> Galway, Ireland, Europe
> Tel. +353 91 495730
> http://sw-app.org/about.html
>
>
>> From: Bernhard Haslhofer <bernhard.haslhofer@univie.ac.at>
>> Date: Thu, 5 Feb 2009 16:35:35 +0100
>> To: Linked Data community <public-lod@w3.org>
>> Subject: Broken Links in LOD Data Sets
>> Resent-From: Linked Data community <public-lod@w3.org>
>> Resent-Date: Thu, 05 Feb 2009 15:36:13 +0000
>>
>>
>> Hi all,
>>
>> we are currently working on the question how to deal with broken  
>> links/
>> references between resources in (distinct) LOD data sets and would
>> like to know your opinion on that issue. If there is some work going
>> on into this direction, please let me know.
>>
>> I think I do not really need to explain the problem. Everybody knows
>> it from the "human" Web when you follow a link and you get an  
>> annoying
>> 404 response.
>>
>> If we assume that the consumers of LOD data are not humans but
>> applications, broken links/references are not only "annoying" but
>> could lead to severe processing errors if an application relies on a
>> kind of "referential integrity".
>>
>> Assume we have an LOD data source X exposing resources that describe
>> images and these images are linked with resources in DBPedia (e.g.,
>> http://dbpedia.org/resource/Berlin)
>> . An application built on-top of X follows links to retrieve the geo-
>> coordinates in order to display the images on a virtual map. If now,
>> for some reason, the URL of the linked DB-Pedia resource changes
>> either because DBPedia is moved or re-organized, which I guess could
>> happen to any LOD source in a long-term perspective, the application
>> might crash if doesn't consider that referenced resources might move
>> or become unavailable.
>>
>> I know that "cool URIs don't change" but I am not sure if this
>> assumption holds in practice, especially in a long-term perspective.
>>
>> For the "human" Web several solutions have been proposed, e.g.,
>> 1.) PURL and DOI services for translating URNs into resolvable URLs
>> 2.) forward references
>> 3.) robust link implementations, i.e., with each link you keep a set
>> of related search terms to retrieve moved / changed resources
>> 4.) observer / notification mechanisms
>> X.) ?
>>
>> I guess (1) is not really applicable for LOD resources because of
>> scalability and single-point of failure issues. (2) would require  
>> that
>> LOD providers take care of setting up HTTP redirects for their moved
>> resources - no idea if anybody will do that in reality and how this
>> can scale. (3) could help to re-locate moved resources via search
>> engines like Sindice but not really fully automatically. (4) could at
>> least inform a data source that certain references are broken and it
>> could remove them.
>>
>> Another alternative is of course to completely leave the problem to
>> the application developers, which means that they must consider  
>> that a
>> referenced resource might exist or not. I am not sure about the
>> practical consequences of that approach, especially if several data
>> sources are involved, but I have the feeling that it is getting  
>> really
>> complicated if one cannot rely on any kind of referential integrity.
>>
>> Are there any existing mechanism that can give us at least some basic
>> feedback about the "quality" of an LOD data source? I think, the
>> referential integrity could be such a quality property...
>>
>> Thanks for your input on that issue,
>>
>> Bernhard
>>
>> ______________________________________________________
>> Research Group Multimedia Information Systems
>> Department of Distributed and Multimedia Systems
>> Faculty of Computer Science
>> University of Vienna
>>
>> Postal Address: Liebiggasse 4/3-4, 1010 Vienna, Austria
>> Phone: +43 1 42 77 39635 Fax: +43 1 4277 39649
>> E-Mail: bernhard.haslhofer@univie.ac.at
>> WWW: http://www.cs.univie.ac.at/bernhard.haslhofer
>>
>>
>
>

______________________________________________________
Research Group Multimedia Information Systems
Department of Distributed and Multimedia Systems
Faculty of Computer Science
University of Vienna

Postal Address: Liebiggasse 4/3-4, 1010 Vienna, Austria
Phone: +43 1 42 77 39635 Fax: +43 1 4277 39649
E-Mail: bernhard.haslhofer@univie.ac.at
WWW: http://www.cs.univie.ac.at/bernhard.haslhofer

Received on Thursday, 12 February 2009 08:25:50 UTC