Re: data quality from Alexander Johannesen on 2010-04-19 (semantic-web@w3.org from April 2010)

From: Alexander Johannesen <alexander.johannesen@gmail.com>
Date: Tue, 20 Apr 2010 09:08:17 +1000
To: Michael Schneider <schneid@fzi.de>
Cc: "Polleres, Axel" <axel.polleres@deri.org>, semantic-web@w3.org, paoladimaio10@googlemail.com
Message-ID: <l2jf950954e1004191608p2c9f888cm7e5f581871902837@mail.gmail.com>

Michael Schneider <schneid@fzi.de> wrote:
> The quality-of-data question is not an easy one, and it's very vague what
> "good quality" means for data. What you are about there on pedantic-web.org
> [1] seems to be an effort to obtain some sort of "minimum practically
> achievable quality" for the data existing on the web. This is very important
> IMO, but other people won't probably be satisfied by this, because (amongst
> other things) this minimum standard won't match their tools' requirements.

Let me give you an example of just how bad these things can be. A few
years ago I worked for a national library as a technology manager of
sorts, and one of the things I brought with me to that position was my
knowledge and love of Topic Maps. So, the natural idea was to take
library information, otherwise known as MARC (or the culture of MARC)
[MAchine Readable Cataloging], and convert the metadata within into
glorious semantic knowledge maps.

The library world has been tinkering with metadata like, forever, and
with MARC from the 80's. They've been polishing and refining and
tinkering with their MARC data for over 30 years, and not only that,
but catalogers are pedantic, thorough and neat. If any collection of
metadata would be in a useful state, it would be this one.

But sadly this is not the case. There's no schema for data, no typing
and rules and tricks are humanly upheld, which means it's a huge
hotch-potch of good records and bad records all mixed up in one, with
no identity management, and with large and costly match / merge
processes that continuously tries to wash and clean these records, and
even then they turned out to be too hard to do anything automatic
with. It was a disaster on many levels, not least to me who in the end
chose to quit.

When your pedantic librarians can't get this right, let me just say
that this problem is slightly bigger than what you might think or even
imagine at times. There's a reason the brilliant minds at Google
aren't doing RDF or strongly typed data. Yet.

Regards,

Alex
-- 
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
--- http://shelter.nu/blog/ ----------------------------------------------
------------------ http://www.google.com/profiles/alexander.johannesen ---

Received on Monday, 19 April 2010 23:08:49 UTC