questions about Library of Congress Thesaurus of Graphic Material s

Hi, 

These questions are aimed mainly at Eric and MacKenzie, but I welcome input
from other people.

I've spent a bit of time searching for datasets of the internet since the
team visited the UK, and I was surprised how little data there is out there.
So my latest observation is in order to realise the semantic web, we need to
encourage more people and organisations to release data publically. At the
moment organizations give human readable versions of the data, but that
doesn't support reuse in the same way. Now the W3C has been actively
promoting RDF here, but I don't think that is the problem - I think the
message they need to get across is just "make your data available in any
machine readable format". It doesn't really matter if the data is RDF or not
- as long as it is in some kind of labelled form, its generally not too much
of a problem to convert it to RDF. Then some RDF geek can make it available
as RDF, so we get to the point we want to anyway. The RDF bit is easy, its
getting the data that is hard. 

For example of this problem, Nasa give away human readable versions of one
of their thesauri, but then charge for machine readable versions - see
http://www.sti.nasa.gov/products.html

Now one data source that is available on the web is the Library of Congress
Thesaurus of Graphic Materials Part 1 and 2 - 
http://lcweb.loc.gov/rr/print/tgm1/downloadtgm1.html

I've been playing with this, and I've written a short Java program to
translate it to the SKOS thesaurus vocabulary
http://www.w3.org/2004/02/skos/core
represented in RDF. There is an ISO standard for thesauri, so it should be
possible to convert the program to work with other thesauri. I've also
incorporated this data source into the demo dataset in Longwell, so people
with Subversion (or CVS, because its in the CVS version as well) can see
this. However incorporating the LOC TGM seems to have caused some strange
sideeffects, so I'm struggling at the moment understanding whether this is
because my approach to data integration is broken or whether it is also
partly due to some structure of the LOC TGM itself.

So some questions:

i) now we have got an RDF/XML version of LOC TGM, it would be great to make
it more widely available, although on a "there may be problems, feedback
please" basis as it will help SKOS and help RDF. But I guess it would be a
good idea to talk to LOC about this first - anybody on the team got contacts
there?

ii) in the LOC TGM, I don't understand some of the relations. For example,
the subject term "cadaver" has a broader term "animals", or "ordinance" has
the broader term "household goods". Why? 

One of the things I've done is taken the LOC TGM and applied it to our
Artstor demo dataset, and then used inference to add the broader terms to
collection items. There are two reasons why it is potentially useful to do
this:

- if someone searches for a broader term, then they will get items that
match the broader term. However, before you added the LOC TGM information,
the broader term relations where not described anyway, so the broader term
would have returned no matches. 

- our browser can make use of hierarchical information, but the subject
index in Artstor is very flat. However by adding LOC TGM, we can in effect
add hierarchical information which makes browsing a little better (although
admittedly the concept space is still fairly flat). 

Here are some example the hierarchical relations inferred from the LOC TGM
from the Artstor demo dataset:

===========

ONES THAT MAKE SENSE

"crimes"
+-- "murders"
+-- "man slaughter"
+-- "homicides"

"Natural phenomena"
+-- "Climate"

"Supernatural"
+-- "Characters, Fictitious"
+-- "Imaginary beings"

"Transportation facilities"
+-- "Highways"
+-- "Roads"

"Vehicles"
+-- "Freight Wagons"
+-- "Coaches"

===========

ONES THAT DO NOT

"Facilities"
+-- "Funerary facilities"
|  +-- "Mausoleums"
|  +-- "Tombstones"
|
+-- "Exhibition facilities"
   +-- "Exposition pavilions"
   +-- "museums"

It makes sense to group mausoleums and tombstones under funerary facilities,
but much less sense to group them with exhibition facilities under
facilities. 

"Events"
+-- "bomb damage"
+-- "processions"

Processions under events makes sense, but not bomb damage.

"Household goods"
+-- "Ordinance"
+-- "Tableware"

Are they thinking of the anarchists cookbook?

"People"
+-- "Dead bodies"
+-- "Deceased"
+-- "Dead animals"
+-- "Personnel"

All the terms relating to death seem related, but then "people" and
"personnel" seem unrelated.

"Pictures"
+-- "Cartoons"
+-- "Comic pictures"
+-- "Humorous pictures"
+-- "Ornaments"
+-- "Paintings"
+-- "Reconstructions"

This is passable, but I'm not sure why "ornaments" is here, as I'd class an
ornament as 3D rather than a 2D object. 

This leads to some strange side effects. For example some images in Goya's
Disasters of War series is indexed with the subject "cadaver" which makes
sense. But adding the LOC TGM broader terms means now Disasters of War also
has "animals" as a subject term, even though many of the images contain
people rather than animals. Why is "animals" a broader term for "cadaver"?

Have I fundamentally misunderstood something about how thesauri work?

iii) There are plenty of other organisations producing thesauri e.g.

WordHoard Initiative http://www.mda.org.uk/wrdhrd1.htm 

English Heritage Thesauri
http://www.english-heritage.org.uk/thesaurus/frequentuser.htm 

Entomology Database library http://entomology.si.edu/entomology/data.lasso 

Schools Online Thesaurus http://scot.curriculum.edu.au/

Thesaurus for applied life sciences http://194.203.77.66/

National Criminal Justice Reference Service Thesaurus
http://abstractsdb.ncjrs.org/content/Thesaurus/Thesaurus_AlphabeticalList.as
p

Medline / MESH subject headings
http://www.nlm.nih.gov/mesh/newd2004.html

so should we pick a few interesting thesauri, approach the organisations and
try to make them available as SKOS / RDF?

other comments?

Dr Mark H. Butler
Research Scientist, HP Labs Bristol
http://www-uk.hpl.hp.com/people/marbut 

Received on Monday, 19 April 2004 13:02:04 UTC