Re: Linked Data and IRI dereferencing (scale limits?) from Paul Houle on 2010-08-05 (public-lod@w3.org from August 2010)

From: Paul Houle <ontology2@gmail.com>
Date: Thu, 5 Aug 2010 10:07:32 -0400
To: Jörn Hees <j_hees@cs.uni-kl.de>
Cc: public-lod@w3.org
Message-ID: <AANLkTinf-UP3VizFJ1cMoP+z8saoCURxKyX7Rd+sx+Vr@mail.gmail.com>

If you want to get something done with dbpedia,  you should (i) work from
the data dumps,  or (ii) give up and use Freebase instead.

I used to spend weeks figuring how to to clean up the mess in dbpedia until
the day I wised up and realized I could do in 15 minutes w/ Freebase what
takes 2 weeks to do w/ dbpedia,  because w/ dbpedia you need to do a huge
amount of data cleaning to get anything that makes sense.

The issue here isn't primarily "RDF vs Freebase" but it's really a matter of
the business model (or lack thereof) behind dbpedia;  frankly,  nobody gets
excited when dbpedia doesn't work,  and that's the problem.  For instance,
nobody at dbpedia seems to give a damn that dbpedia contains 3000
"countries",  wheras there's more like 200 actual active countries in the
world...  Sure,  it's great to have a category for things like
"Austria-Hungary" and "The Teutonic Knights",  but an awful lot of people
give up on dbpedia when they see they can't easily get a list of very basic
things,  like a list of countries.

Now,  I was able to,  more-or-less,  define "active country" as a
restriction type:  anything that has an ISO country code in freebase is an
active country,  or is pretty close.  The ISO codes aren't in dbpedia
(because they're not in wikipedia infoboxes) so this can't be done with
dbpedia:  i'd probably need to code some complex rules that try to guess at
this based on category memberships and what facts are available in the
infobox.

I complained on both dbpedia and freebase discussion lists,  and found
that:  (i) nobody at dbpedia wants to do anything about this,  and (ii) the
people at freebase have investigated this and they are going to do something
about it.

--------

In my mind,  anyway,  the semantic web is a set of structured boxes. It's
not like there's one "T Box" and one "A Box" but there are nested boxes of
increasing specificity.  In the systems I'm building,  a Freebase-dbpedia
merge is used as a sort of "T' Box" that helps to structure and interpret
information that comes from other sources.  With a little thinking about
data structures,  it's efficient to have a local copy of this data and use
it as a skeleton that gets fleshed out with other stuff.  Closed-world
reasoning about this "taxonomic core" is useful in a number of ways,
particularly in the detection of key integrity problems,  data holes,
inconsistencies,  junk data,  etc.  I think the "dereference and merge"
paradigm is useful once you've got the taxocore and you're merging little
bits of high-qualtiy data,  but w/o control of the taxocore you're just
doomed.

Received on Thursday, 5 August 2010 14:08:05 UTC