- From: Nick Matsakis <matsakis@mit.edu>
- Date: Mon, 27 Oct 2003 11:59:26 -0500 (EST)
- To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
- Cc: SIMILE public list <www-rdf-dspace@w3.org>
On Mon, 27 Oct 2003, Butler, Mark wrote: > We now have two [corpora]: > 1. a large one from ArtStor of visual image metadata > 2. a much smaller one of learning object data from OpenCourseWare Is my impression correct that, in these corpora, it is only 'people' that are being matched up, rather than say, published works or other resource classes? > One of the issues we've been discussing is actually how much of the name > canonicalization should be done in the XSLT transform, and how much > should be left to programs that work on the [ingested] RDF I have faced this problem in bibliographic data extracted from bibtex files. My philosophy is that software that translate other formats into RDF should be be split into two functional stages. The first stage should be a simple translation from the native format to RDF, the the second stage should apply more complex transformations to the translated RDF to ensure the resulting metadata is as clean as possible. You can think of these as stages as first 'ingesting' and then 'digesting' the source metadata. In the case of the current corpora, it seems like the XSLT transformations should be considered 'ingesting' software. There are two properties that I think ingesting software should have. First, it should translate the source format (vcard, bibtex, vera, ...) into an RDF schema that has a one-to-one mapping between the properties in the source format and those in the RDF schema. This separates domain knowledge about the syntax of the original format from domain knowledge about the semantics of the original format. As a side benefit, it encourages the creation of an RDF schema for the original source format. This allows us to go 'back to the source' when debugging digesting software without literally going back to the source. My second requirement for ingestion software is that any record linkage it does, including name canonicalization, err on the side of caution. There are two types of linkage errors, either linking distinct entities or failing to link identical entities. Of these, linking distinct entities is the more grave, for reasons I hope are obvious. Thus, it should be unacceptible for ingestion software to assign the same URI to distinct entities, even if the cost is that it will miss some 'obvious' connections. Making these connections is the task of digesting software. Nick Matsakis
Received on Monday, 27 October 2003 12:07:43 UTC