Chronicling America and Linked Data from Ed Summers on 2009-05-26 (public-lod@w3.org from May 2009)

From: Ed Summers <ehs@pobox.com>
Date: Tue, 26 May 2009 11:19:58 -0400
To: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <f032cc060905260819p4831dbfeyeccec4a21a66b8ee@mail.gmail.com>
There is a new pool of linked-data up at the Library of Congress in
the Chronicling America application [1]. Chronicling America is the
web view on data collected for the National Digital Newspaper Program
(NDNP). NDNP is a 20-year joint project of the National Endowment for
the Humanities and the Library of Congress to digitize and aggregate
historic newspaper in the United States.

Right now there are close to a million digitized newspaper pages
available, and information about 140,000 newspaper titles...all of
which have individual web views, for example:

 Newspaper Title: San Francisco Call [2]
 Issue: San Francisco Call, 1895-03-05 [3]
 Page: San Francisco Call, 1895-03-05, page sequence 1 [4]

If you view source on them you should be able to auto-discover the
application/rdf+xml representations that bundle up information about
the newspaper titles, issues and pages. You can also browse around
using a linked data viewer like uriburner [5].

The implementation is a moving target, but you'll see we've cherry
picked a few vocabularies: Dublin Core [6], Bibliographic Ontology
[7], FOAF [8], and Object Reuse and Exchange (OAI-ORE) [9]. ORE in
particular was extremely useful to us, since we wanted to enable the
application's repository function, by exposing the digital objects
(image files, ocr/xml files, pdfs) that make up the individual Page
resources. For example:

<http://chroniclingamerica.loc.gov/lccn/sn84026749/1905-01-29/ed-1/seq-1#page>
    ore:aggregates
<http://chroniclingamerica.loc.gov/lccn/sn84026749/1905-01-29/ed-1/seq-1.jp2>,
<http://chroniclingamerica.loc.gov/lccn/sn84026749/1905-01-29/ed-1/seq-1.pdf>,
<http://chroniclingamerica.loc.gov/lccn/sn84026749/1905-01-29/ed-1/seq-1/ocr.txt>,
<http://chroniclingamerica.loc.gov/lccn/sn84026749/1905-01-29/ed-1/seq-1/ocr.xml>,
<http://chroniclingamerica.loc.gov/lccn/sn84026749/1905-01-29/ed-1/seq-1/thumbnail.jpg>
.

The idea is to enable the harvesting of these repository objects out
of the Chronicling American webapp. The only links out we have so far
are from Newspaper Titles to the geographic regions that they are
"about", and languages. So for example:

<http://chroniclingamerica.loc.gov/lccn/sn85066387#title>
    dcterms:coverage
<http://dbpedia.org/resource/San_Francisco%2C_California>,
<http://sws.geonames.org/5391959/> ;
    dcterms:language <http://www.lingvoj.org/lang/en> .

Just these minimal links provide a huge amount of data enrichment to
our original data. We also needed to create a handful of new
vocabulary terms, which we made available as RDFa [10]. I would be
interested in any feedback you have. Also, please feel free to fire up
linked-data bots to crawl the space.

//Ed

[1] http://chroniclingamerica.loc.gov
[2] http://chroniclingamerica.loc.gov/lccn/sn85066387/
[3] http://chroniclingamerica.loc.gov/lccn/sn85066387/1895-03-05/ed-1/
[4] http://chroniclingamerica.loc.gov/lccn/sn85066387/1895-03-05/ed-1/seq-1/
[5] http://linkeddata.uriburner.com/about/html/http/chroniclingamerica.loc.gov/lccn/sn84026749%23title
[6] http://dublincore.org/
[7] http://bibliontology.com/
[8] http://xmlns.com/foaf/spec/
[9] http://www.openarchives.org/ore/1.0/vocabulary.html
[10] http://chroniclingamerica.loc.gov/terms/
Received on Tuesday, 26 May 2009 15:20:41 UTC