RE: Record Linkage in Simile from Butler, Mark on 2003-10-27 (www-rdf-dspace@w3.org from October 2003)

From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
Date: Mon, 27 Oct 2003 11:32:17 -0000
To: "'Nick Matsakis'" <matsakis@mit.edu>, SIMILE public list <www-rdf-dspace@w3.org>
Message-ID: <E864E95CB35C1C46B72FEA0626A2E80820620E@0-mail-br1.hpl.hp.com>
Hi Nick

> So, if you are interested, please let me know what I can do 
> to help out.
> At the end of the day, what I would like to have is a corpus 
> of RDF data
> with the duplicate entries labelled so that I can test different
> algorithms on it, and hopefully a body of code to support such tests.

Yes, we've had exactly the same problem - we needed to get hold of corpori
to start work. We now have two: 

1. a large one from ArtStor of visual image metadata (approx 100,000
records).

2. a much smaller one of learning object data from OpenCourseWare (27
courses).

In addition, we may get a third one of CIDOC data, this is museum data
although I'm not quite sure of other details at the moment.

These corpori were obtained in XML form, so at the moment we are working on
XSLT transforms to convert the data to RDF. One of the issues we've been
discussing is actually how much of the name canonicalization should be done
in the XSLT transform, and how much should be left to programs that work on
the RDF once it has been ingested.

At the moment we need to be very careful with the metadata and ensure it
does not become publically available, so the corpori are not available on
the web. However you already have an IPS Sources login, so you potentially
have access to the SIMILE CVS? More details of how to set up CVS are
available at http://ipssources.com/ 

Here are some more details of the contents of the CVS which is relevant to
you:

1. We haven't loaded the entire Artstor corpus into CVS because even
compressed it is quite large (15 megabytes). However there are some samples
from this corpus - see 

simile/corpus/artstor/metadata/sample_single.xml
simile/corpus/artstor/metadata/sample_small.xml
simile/corpus/artstor/metadata/sample_medium.xml

The XSLT transform to turn these files into RDF, and the RDFS Schema are

simile/corpus/artstor/artstor.xsl
simile/corpus/arstor/vra-schema-andy-revised.n3

respectively although both these files are still under a process of
revision. The stylesheet uses XSLT 2.0, so to run it you will need Saxon 7.7
available from http://saxon.sourceforge.net/

2. The OpenCourseWare corpus is in the CVS - see
simile/corpus/IMS/ocw/xml/*.xml

The XSLT transform to turn these files into RDF and the RDFS Schema(s) are

simile/corpus/IMS/ocw/templateSaxon.xsl
simile/corpus/IMS/schema/*.rdfs

Feedback on the RDF output by the transforms and the VRA schema is welcome. 

Once we are happy with the output of these transforms, we plan to load them
into a number of Joseki servers, for use internally in HP and at MIT.
However to do this we need to ensure the servers are secure. There has been
some discussion about how to do this already. Hopefully we should have this
available within the next 2-3 weeks. 

> I've only been following the simile discussions at a high 
> level, so I'm
> not sure exactly what your needs are or exactly what the dataflow in
> Simile looks like.  

At the moment we are working on a demo that demonstrates the possibility of
mapping between different vocabularies. We haven't proposed a dataflow for
the demo, but it is unlikely if we do that it will reflect the one finally
used in SIMILE. For more details of the demo see 
http://lists.w3.org/Archives/Public/www-rdf-dspace/2003Sep/att-0033/demo_scr
ipt_v2.pdf

However John Gilbert, an intern working with me over the summer, has looked
at a possible SIMILE work flow. John's second tech report discusses this
(I'm still working on this one), but for now see slide 6 of this
presentation
http://lists.w3.org/Archives/Public/www-rdf-dspace/2003Sep/att-0055/sw_tool_
investigation.pdf

(BTW, I forgot to mention it, John's first report is now available here
http://www-uk.hpl.hp.com/people/marbut/schemasMetadataThesauri.pdf)
 
> Do you have a fixed corpus for which you need to
> extract duplicates, 

For the demo we effectively have fixed corpori. For SIMILE as a system
however this is something that will happen whenever records are ingested. 

> or do you expect to be doing this 
> frequently?  What
> are the short term and long term applications for record linking in
> Simile?

Well my hypothesis (and I want to qualify it as mine, it may not be held by
the rest of the team) is that record linking is potentially more important
for interoperability between collections than simply mapping between
different schema properties and classes. I've written about this a few times
on the list, but I'll try to restate my argument to see if I can get it
clearer. 

When we build systems that help users retrieve resources, it is tempting
just to build catalogues of those resources. However such models are of
limited effectiveness. A better approach is to build a system that
formalizes the conceptual model used by users to navigate those resources.
One possible conceptual model is outlined in the FRBR (Functional
Requirements for Bibliographic Records) specification, which doesn't just
represent the collection (called group 1 entities in FRBR terminology) but
also people and organizations who are related to the collection (group 2
entities) and concepts that are related to the collection (group 3
entities). For more details of FRBR see 
http://www.ifla.org/VII/s13/frbr/frbr.pdf

Now the problem with just mapping between schemas is that different schemas
describe different relationships between these entities. For example, a
learning object may have an artist as its subject whereas a visual resource
has that artist as its creator. Now mappings between these two schemas are
likely to say that learning_object:creator is a similar concept to
visual_image:creator, but that mapping may not be helpful if we want to
retrieve all the resources relating to a particular artist. The only way to
do this is to do record linking between group 1, 2 and 3 entities. So my
expectation is that record linking will identify relationships between
collections which would not be identified solely by simple schema mapping. 

For more discussion of this see
http://lists.w3.org/Archives/Public/www-rdf-dspace/2003Sep/0035.html
http://lists.w3.org/Archives/Public/www-rdf-dspace/2003Sep/0038.html
http://lists.w3.org/Archives/Public/www-rdf-dspace/2003Sep/0043.html

However this is a hypothesis so we will see.

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Monday, 27 October 2003 06:33:27 UTC