Re: Question on press use cases from Ed Summers on 2010-09-19 (public-lld@w3.org from September 2010)

From: Ed Summers <ehs@pobox.com>
Date: Sun, 19 Sep 2010 06:12:55 -0400
To: Antoine Isaac <aisaac@few.vu.nl>
Cc: public-lld <public-lld@w3.org>
Message-ID: <AANLkTi=Mvw8zPX-0hjxfb+SRDHnKQt8TPEAe9RfkXwu+@mail.gmail.com>

On Sat, Sep 18, 2010 at 10:42 AM, Antoine Isaac <aisaac@few.vu.nl> wrote:
> So the question is whether the current situation results rather from:
> - a conscious choice of ignoring part of the legacy data you had in the
> original data sources, in the light of the requirements of your scenarios?
> - a too great effort needed to move legacy data to linked data, considering
> the resources you had?
> - the lack of legacy data--you just converted all what you had?

That's a really good question Antoine. I think it's often easy to get
lost (and demoralized) when trying to figure out what granularity to
model existing data at in RDF...especially when there's a lot of it.
In our case we had quite a bit of machine readable data in METS XML
[1] and MARC. The Chronicling America web application needed to model
only a small fraction of this data in order to deliver content
meaningfully on the web.

For example, when we loaded our "batches" of content into Chronicling
America, we didn't need to model (in the database) the intricacies of
the MIX metadata (colorspaces, scanning systems, sampling frequencies,
etc) -- we just needed to know that the image format, and that it had
particular dimensions in order to render the page. And when we loaded
newspaper title metadata we didn't need to model all of the MARC
record, we just needed to model its name, where it was published, its
geographic coverage, etc.

When we decided to use Linked Data, and in particular OAI-ORE, to make
a digital resources harvestable we didn't go back and exhaustively
model all the things we could have in order to make them available in
RDF. We simply made the things we already had modeled in our
relational database available, using pre-existing vocabularies
wherever possible. This made the task of implementing Linked Data
pretty easy, and it was coded up in a couple days of work. In some
cases it was also possible to link resources to non-RDF documents
(like the METS and MARC XML). We focused on the use case of making the
titles, issues, pages, and their bit streams web harvestable.

One early consumer of the data was another unit at the Library of
Congress that wanted to periodically select and upload images of
newspaper front pages to Flickr [2]. In order to do this they wanted a
thumbnail, and to know the dimensions of the original jpeg2000 file in
order to construct a custom URL for a high resolution image to upload
to Flickr. So we added these things to the RDF representation of the
Page. If you are interested I described this process a bit more last
year [3].

I guess this is a long way of saying our Linked Data was "a conscious
choice of ignoring part of the legacy data [we] had in the original
data sources, in the light of the requirements of [our] scenarios".
Letting what's easy and actually useful to someone drive what Linked
Data gets published is a good way to get something working quickly,
and for enriching it over time.

//Ed

[1] http://www.loc.gov/ndnp/pdf/NDNP_201113TechNotes.pdf (73 pages of
notes on the data)
[2] http://www.flickr.com/photos/library_of_congress/sets/72157619452486566/
[3] http://inkdroid.org/journal/2009/07/09/flickr-digital-curation-and-the-web/

Received on Sunday, 19 September 2010 10:13:23 UTC