Re: Making human-friendly linked data pages more human-friendly from Paul A Houle on 2009-09-17 (public-lod@w3.org from September 2009)

From: Paul A Houle <devonianfarm@gmail.com>
Date: Thu, 17 Sep 2009 11:59:09 -0400
To: Kingsley Idehen <kidehen@openlinksw.com>
Cc: public-lod@w3.org
Message-ID: <ad20490909170859x7c7e540di57bc4b4194437c59@mail.gmail.com>

On Thu, Sep 17, 2009 at 7:23 AM, Kingsley Idehen <kidehen@openlinksw.com>wrote:

>
>
> This is basically an aspect of the whole Linked Data meme that is lost on
> too many.
>
>
I've got to thank the book by Allemang and Hendler

http://www.amazon.com/Semantic-Web-Working-Ontologist-Effective/dp/0123735564

for setting me straight about data modeling in RDF.  RDFS and OWL are based
on a system of duck typing that turns conventional object or
object-relational thinking inside out.  It's not necessarily good or bad,
but it's really different.  Even though types matter,  predicates come
before types because using predicate A can make object B become a member of
type C,  even if A is never explicitly put in class C.

Looking at the predicates in RDFS or OWL and not understanding the whole,
it's pretty easy to be like "oh,  this isn't too different from a relational
database" and miss the point that RDFS&OWL is much more about inference
(creating new triples) than it is about constraints or the physical layout
of the data.

One consequence of this is that using an existing predicate can drag in a
lot more baggage than you might want;  it's pretty easy to get the inference
engine to infer too much,  and false inferences can snowball like a
katamari.

A lot of people are in the habit of reusing vocabularies and seem to forget
that the natural answer to most RDF modeling problems is to create a new
predicate.  OWL has a rich set of mechanisms that can tell systems that

x A y -> x B y

where A is your new predicate and B is a well-known predicate.  Once you
merge two "almost-but-not-the-same" things by actually using the same
predicate,  it's very hard to fix the damage.  If you use inference,  it's
easy to change your mind.

--------------

It may be different with other data sets,  but data cleaning is absolutely
essential working with dbpedia if you want to make production-quality
systems.

For instance,  all of the time people build bizapps and they need a list of
US states...  Usually we go and cut and paste one from somewhere...  But now
I've got dbpedia and I should be able to do this systematically.  There's a
category in wikipedia for that...

http://en.wikipedia.org/wiki/Category:States_of_the_United_States

if you ignore the subcategories and just take the actual pages,  it's
(almost) what you need,  except for some weirdos like

User:Beebarose/Alabama <http://en.wikipedia.org/wiki/User:Beebarose/Alabama>

and one state that's got a disambiguator in the name:

Georgia (U.S. state) <http://en.wikipedia.org/wiki/Georgia_%28U.S._state%29>

It's not hard to clean up this list,  but it takes some effort,  and
ultimately you're probably going to materialize something new.

These sorts of issues even turn up in highly clean data sets.  Once I built
a webapp that had a list of countries in it,  this was used to draw a
dropdown list,  but the dropdown list was excessively wide,  busting the
layout of the site.  Now,  the list was really long because there were a few
authoritarian countries with long and flowery names.  The transformation
from

*Democratic People's Republic of Korea -> North Korea

*improved the usability of the site while eliminating Orwellian language.
This kind of "fit and finish" is needed to make quality sites,  and semweb
systems are going to need automated and manual ways of fixing this so that
"Web 3.0" looks like a step forward,  not a step back.

Received on Thursday, 17 September 2009 15:59:49 UTC