- From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
- Date: Mon, 27 Oct 2003 11:32:17 -0000
- To: "'Nick Matsakis'" <matsakis@mit.edu>, SIMILE public list <www-rdf-dspace@w3.org>
Hi Nick > So, if you are interested, please let me know what I can do > to help out. > At the end of the day, what I would like to have is a corpus > of RDF data > with the duplicate entries labelled so that I can test different > algorithms on it, and hopefully a body of code to support such tests. Yes, we've had exactly the same problem - we needed to get hold of corpori to start work. We now have two: 1. a large one from ArtStor of visual image metadata (approx 100,000 records). 2. a much smaller one of learning object data from OpenCourseWare (27 courses). In addition, we may get a third one of CIDOC data, this is museum data although I'm not quite sure of other details at the moment. These corpori were obtained in XML form, so at the moment we are working on XSLT transforms to convert the data to RDF. One of the issues we've been discussing is actually how much of the name canonicalization should be done in the XSLT transform, and how much should be left to programs that work on the RDF once it has been ingested. At the moment we need to be very careful with the metadata and ensure it does not become publically available, so the corpori are not available on the web. However you already have an IPS Sources login, so you potentially have access to the SIMILE CVS? More details of how to set up CVS are available at http://ipssources.com/ Here are some more details of the contents of the CVS which is relevant to you: 1. We haven't loaded the entire Artstor corpus into CVS because even compressed it is quite large (15 megabytes). However there are some samples from this corpus - see simile/corpus/artstor/metadata/sample_single.xml simile/corpus/artstor/metadata/sample_small.xml simile/corpus/artstor/metadata/sample_medium.xml The XSLT transform to turn these files into RDF, and the RDFS Schema are simile/corpus/artstor/artstor.xsl simile/corpus/arstor/vra-schema-andy-revised.n3 respectively although both these files are still under a process of revision. The stylesheet uses XSLT 2.0, so to run it you will need Saxon 7.7 available from http://saxon.sourceforge.net/ 2. The OpenCourseWare corpus is in the CVS - see simile/corpus/IMS/ocw/xml/*.xml The XSLT transform to turn these files into RDF and the RDFS Schema(s) are simile/corpus/IMS/ocw/templateSaxon.xsl simile/corpus/IMS/schema/*.rdfs Feedback on the RDF output by the transforms and the VRA schema is welcome. Once we are happy with the output of these transforms, we plan to load them into a number of Joseki servers, for use internally in HP and at MIT. However to do this we need to ensure the servers are secure. There has been some discussion about how to do this already. Hopefully we should have this available within the next 2-3 weeks. > I've only been following the simile discussions at a high > level, so I'm > not sure exactly what your needs are or exactly what the dataflow in > Simile looks like. At the moment we are working on a demo that demonstrates the possibility of mapping between different vocabularies. We haven't proposed a dataflow for the demo, but it is unlikely if we do that it will reflect the one finally used in SIMILE. For more details of the demo see http://lists.w3.org/Archives/Public/www-rdf-dspace/2003Sep/att-0033/demo_scr ipt_v2.pdf However John Gilbert, an intern working with me over the summer, has looked at a possible SIMILE work flow. John's second tech report discusses this (I'm still working on this one), but for now see slide 6 of this presentation http://lists.w3.org/Archives/Public/www-rdf-dspace/2003Sep/att-0055/sw_tool_ investigation.pdf (BTW, I forgot to mention it, John's first report is now available here http://www-uk.hpl.hp.com/people/marbut/schemasMetadataThesauri.pdf) > Do you have a fixed corpus for which you need to > extract duplicates, For the demo we effectively have fixed corpori. For SIMILE as a system however this is something that will happen whenever records are ingested. > or do you expect to be doing this > frequently? What > are the short term and long term applications for record linking in > Simile? Well my hypothesis (and I want to qualify it as mine, it may not be held by the rest of the team) is that record linking is potentially more important for interoperability between collections than simply mapping between different schema properties and classes. I've written about this a few times on the list, but I'll try to restate my argument to see if I can get it clearer. When we build systems that help users retrieve resources, it is tempting just to build catalogues of those resources. However such models are of limited effectiveness. A better approach is to build a system that formalizes the conceptual model used by users to navigate those resources. One possible conceptual model is outlined in the FRBR (Functional Requirements for Bibliographic Records) specification, which doesn't just represent the collection (called group 1 entities in FRBR terminology) but also people and organizations who are related to the collection (group 2 entities) and concepts that are related to the collection (group 3 entities). For more details of FRBR see http://www.ifla.org/VII/s13/frbr/frbr.pdf Now the problem with just mapping between schemas is that different schemas describe different relationships between these entities. For example, a learning object may have an artist as its subject whereas a visual resource has that artist as its creator. Now mappings between these two schemas are likely to say that learning_object:creator is a similar concept to visual_image:creator, but that mapping may not be helpful if we want to retrieve all the resources relating to a particular artist. The only way to do this is to do record linking between group 1, 2 and 3 entities. So my expectation is that record linking will identify relationships between collections which would not be identified solely by simple schema mapping. For more discussion of this see http://lists.w3.org/Archives/Public/www-rdf-dspace/2003Sep/0035.html http://lists.w3.org/Archives/Public/www-rdf-dspace/2003Sep/0038.html http://lists.w3.org/Archives/Public/www-rdf-dspace/2003Sep/0043.html However this is a hypothesis so we will see. Dr Mark H. Butler Research Scientist HP Labs Bristol mark-h_butler@hp.com Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Monday, 27 October 2003 06:33:27 UTC