Re: Broken Links in LOD Data Sets

On 2/5/09 10:35 AM, Bernhard Haslhofer wrote:
>
> Hi all,
>
> we are currently working on the question how to deal with broken 
> links/references between resources in (distinct) LOD data sets and 
> would like to know your opinion on that issue. If there is some work 
> going on into this direction, please let me know.
>
> I think I do not really need to explain the problem. Everybody knows 
> it from the "human" Web when you follow a link and you get an annoying 
> 404 response.
>
> If we assume that the consumers of LOD data are not humans but 
> applications, broken links/references are not only "annoying" but 
> could lead to severe processing errors if an application relies on a 
> kind of "referential integrity".
>
> Assume we have an LOD data source X exposing resources that describe 
> images and these images are linked with resources in DBPedia (e.g., 
> http://dbpedia.org/resource/Berlin). An application built on-top of X 
> follows links to retrieve the geo-coordinates in order to display the 
> images on a virtual map. If now, for some reason, the URL of the 
> linked DB-Pedia resource changes either because DBPedia is moved or 
> re-organized, which I guess could happen to any LOD source in a 
> long-term perspective, the application might crash if doesn't consider 
> that referenced resources might move or become unavailable.
>
> I know that "cool URIs don't change" but I am not sure if this 
> assumption holds in practice, especially in a long-term perspective.
Well Null Pointers have dogged programmers for eons.
You have to test for Null Pointers (URIs) when programming for the 
Linked Data Web too.

Now if a pointer is Null, you have to think about (in an application 
specific way) how to locate the same values from elsewhere, and then 
decide if (should you find what you seek) how to persist in your own 
space (i.e. make a new URI for these values).

You can do many things with a 404 condition courtesy of SPARQL.

Coincidentally, I touched on this resilience matter during the Beijing 
Linked Data Workshop, as per this excerpt from Orri's blog post [1], 
following the workshop:

"What to do when identity expires?

Giovanni of Sindice said that a document should be removed from search 
if it was no longer available. Kingsley pointed out that resilience of 
reference requires some way to recover data. The data web cannot be less 
resilient than the document web, and there is a point to having access 
to history. He recommended hooking up with the Internet Archive, since 
they make long term persistence their business. In this way, if an 
application depends on data, and the URIs on which it depends are no 
longer dereferenceable or or provide content from a new owner of the 
domain, those who need the old version can still get it and host it 
themselves."

>
> For the "human" Web several solutions have been proposed, e.g.,
> 1.) PURL and DOI services for translating URNs into resolvable URLs
> 2.) forward references
> 3.) robust link implementations, i.e., with each link you keep a set 
> of related search terms to retrieve moved / changed resources
> 4.) observer / notification mechanisms
> X.) ?
All nice ideas. Usage will be application and scenario specific, naturally.
>
> I guess (1) is not really applicable for LOD resources because of 
> scalability and single-point of failure issues.
If you take a closer look at the federation that EC2 accords, and how we 
are making it easy for anyone to have their Linked Data driven 
Knowledgebases for personal and service specific use [2], you might spot 
a little nuance: we always link back to an original source data object 
URI (a form of intrinsic Attribution by Reference). The idea being that 
this kind of federation ultimately builds up URI resilience in a manner 
that's similar to general Internet resilience (you can slow it down or 
inconvenience it, but never erase it due to "scale free" attribute of 
real federation).

> (2) would require that LOD providers take care of setting up HTTP 
> redirects for their moved resources - no idea if anybody will do that 
> in reality and how this can scale. (3) could help to re-locate moved 
> resources via search engines like Sindice but not really fully 
> automatically. (4) could at least inform a data source that certain 
> references are broken and it could remove them.
>
> Another alternative is of course to completely leave the problem to 
> the application developers, which means that they must consider that a 
> referenced resource might exist or not. I am not sure about the 
> practical consequences of that approach, especially if several data 
> sources are involved, but I have the feeling that it is getting really 
> complicated if one cannot rely on any kind of referential integrity.
In a nutshell, yes but this is about data architects and developers 
working in concert as part of product and service delivery.

>
> Are there any existing mechanism that can give us at least some basic 
> feedback about the "quality" of an LOD data source? I think, the 
> referential integrity could be such a quality property...
In an "Open World" the notion of "Quality" is inherently "Subjective".  
The "Beauty & Beholder" rules apply at all scales in our universe :-)

Links:

1. 
http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1347
2. http://dbpedia2.openlinksw.com:8895/resource/Berlin - localized 
de-referencing and attribution link to source via ow:sameAs (all EC2 
versions of DBpedia, Bio2Rdf, NeuroCommons, and MusicBrainz get this. 
Ditto the imminent Virtuoso Cluster Edition hosted LOD Cloud)
3. 
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtInstallationEC2 
- EC2 AMI Home Page


Kingsley
>
> Thanks for your input on that issue,
>
> Bernhard
>
> ______________________________________________________
> Research Group Multimedia Information Systems
> Department of Distributed and Multimedia Systems
> Faculty of Computer Science
> University of Vienna
>
> Postal Address: Liebiggasse 4/3-4, 1010 Vienna, Austria
> Phone: +43 1 42 77 39635 Fax: +43 1 4277 39649
> E-Mail: bernhard.haslhofer@univie.ac.at
> WWW: http://www.cs.univie.ac.at/bernhard.haslhofer
>
>
>


-- 


Regards,

Kingsley Idehen	      Weblog: http://www.openlinksw.com/blog/~kidehen
President&  CEO
OpenLink Software     Web: http://www.openlinksw.com

Received on Thursday, 5 February 2009 17:41:46 UTC