[making-corpus-rdf] discussion notes on ims

Kevin and I just spent some time discussing IMS on the phone - here are the
major points:

1. Different URIs for the same resource

I made a mistake, in fact these URIs point to the same document

<http://ocw.mit.edu/NR/rdonlyres/Urban-Studies-and-Planning/11-208Introducti
on-to-Computers-in-Public-Management-IIJanuary--IAP-2002/C7D9370D-8BD3-48C2-
B367-D4917E9BDC14/0/lect4.pdf>

<http://ocw.mit.edu/NR/rdonlyres/C7D9370D-8BD3-48C2-B367-D4917E9BDC14/0/lect
4.pdf>

so there must be some kind of redirection on the OCW website. Kevin noted
that OCW itself uses the first URI, so we are probably better using that URI
although its a bit longer.

=> Decision: use first version.

I suggested we could simplify this in the N3 by adding a namespace
definition in the RDF/XML like
xmlns:ocwcontent='http://ocw.mit.edu/NR/rdonlyres/'
as I thought the N3 writer would abbreviate it although this didn't work -
Andy is it possible to use prefixes to shorten URIs in N3?

2. lom-tech:location

Kevin said in the spec they say to use this only if the resource is
somewhere it can't be retrieved by a URL, so he just put the file path to
the resource. We can retrieve the resource via the subject URL. The file
path is only of use to OCW, so I don't think we need to include it. If we
used lom-tech:location to point back to the subject, but that would
introduce a loop which is confusing.

=> Omit lom-tech:location.

3. Canonicalizing names

There has been a bit of discussion about canonicalizing names and we decided
to try to do what was easy, but leave hard canonicalization up to authority
file webservices. So there are two unresolved questions here.

- the IMS metadata uses VCard, whereas at the moment the Artstor transform
uses a homegrown Person class with four properties: forename, surname, birth
and death. Andy, do you think I should switch the Artstor transform to use
VCard?

- The most common format for names in Artstor is "surname, forename,
birth-death" whereas the most common format in IMS is "forename surname"
although both collections contain variations. Should we at least try to
create a common format e.g. "forename surname" name although this won't work
for all instances?

One more point about canonicalizing names. Kevin noted that in fact when we
try to match duplicates, its harder if we try to guess what different parts
of a name field mean e.g. consider Pissarro, Camille, 1830-1903 versus
Camille Pissarro, 1830-1903. We can try to use the additional tokens to
guess surname and forename order. However the most accurate representation
might be to reflect the fact that all we know is that these are identifiers 

<person:identifiers>
	<rdf:Bag>
		<rdf:li>Pissarro</rdf:li>
		<rdf:li>Camille</rdf:li>
		<rdf:li>1830</rdf:li>
		<rdf:li>1903</rdf:li>
	</rdf:Bag>
</person:identifiers>

and then try a multiple permutation match to see if two records refer to the
same person.

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/

Received on Friday, 24 October 2003 12:09:15 UTC