RE: Overview of design decisions made creating stylesheet and sch ema for Artstor data from Butler, Mark on 2003-10-13 (www-rdf-dspace@w3.org from October 2003)

From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
Date: Mon, 13 Oct 2003 12:02:04 +0100
To: "'www-rdf-dspace@w3.org'" <www-rdf-dspace@w3.org>
Message-ID: <E864E95CB35C1C46B72FEA0626A2E8082061B9@0-mail-br1.hpl.hp.com>
Hi Kevin,

>> Kevin writes in regard to Andy's suggestion

> Creating an class for Person is fine, but combining multiple schemas 
> into the same Person object I think is an error.  

then later you say

> in other words, instead of extending Person with the 
> contents 
> of each new corpus, each new corpus can maintain its own 
> Person class, 
> each with its own meaning,

I don't think this is a problem, because RDF supports multiple inheritance,
so each new corpus can still maintain its own Person class. We have a single
URI, that represents the concept of Leonardo da Vinci, and this can be an
instance of several different classes concurrently, with the properties
necessary to be members of each class. The important point is identifying
these instances apply to the same individual, and indicating that via the
URI. This is what your SoundExSimilarPerson and GettyULANPerson classes are
doing, right? We also get deconflict automatically as the properties are in
different namespaces. 

To put it another way, 

objectA
[
rdf:type typeB
rdf:type typeC
b:propertyD "value1"
c:propertyE "value2"
]

is equivalent to

objectD
[
rdf:type typeB
b:propertyD "value1"
b:sameAs objectE
]

objectE
[
rdf:type typeC
c:propertyE "value2"
c:sameAs objectD
]

right?

> Rather than replace the original meaning, what you 
> need is to apply an adaptor pattern to adjust the meaning to a new 
> context; 

By adaptor pattern, do you invisage an ontology (OWL) or RDFS document, or
do you mean a programmatic description?

One reason we might want to use an adaptor pattern is it allows us to
normalize the data. We are used to the idea of normalizing data in
relational databases, but the idea is also applicable to XML - see [1] and
[2] - and I hypothesise RDF. It seems counterintuitive to talk about
normalization in RDF, because if we pick our first class entities correctly,
we get normalization for free, but I guess by thinking about (RDF) models
from a normalization perspective we can check how well designed a model is. 

When we map between corpori, and come up with representations of individuals
that combine multiple vocabularies similar to those above, we can consider
normalization also. Clearly an instance having multiple properties,
associated with different namespaces, that contains duplicates of the same
value is a bad idea. Where there is a consistent duplication, we could omit
properties and use inference and subproperty relations instead. 

However compound relations are more complicated e.g. in Andy's example there
is a relation between artstorID and familyName, givenName, dateOfBirth,
dateOfDeath. In the subsequent discussion, let's call the latter the
galleryCard representation (because its similar to vCard but we have DOB/DOD
also). The relationship between artstorID and the galleryCard representation
is more complicated one way than the other: to go from artstorID to
galleryCard we have to do some kind of tokenization, which is potentially
unreliable. However to move from the galleryCard to artstorID is easier
because we just aggregate.

Therefore to perform normalization, it seems attractive to take artstorID at
ingest, break it in to galleryCard, and then implement some kind of viewer
to aggregate back to the artstorID representation. We can represent both
relations between the galleryCard properties and artstorID programmatically,
but I don't think we can indicate such relations using languages like OWL -
perhaps an OWL expert can correct me here if I'm wrong?

However I think there is another design principle here that overrides the
need for normalization. Historians talk about primary and secondary sources,
so the problem with using the split at ingest / reaggregate is we have
thrown away a primary source and are rebuilding it using a secondary source.
Despite the need for normalization, this seems a bad idea. So I think it is
okay to split to galleryCard at ingest, but I'm keen for us to keep the
original "Leonardo,da Vinci,1452-1519" as well. 

[1] Normalizing XML, part 1, Will Provost, XML.com
http://www.xml.com/pub/a/2002/11/13/normalizing.html

[2] Normalizing XML, part 2, Will Provost, XML.com,
http://www.xml.com/pub/a/2002/12/04/normalizing.html

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Monday, 13 October 2003 07:02:36 UTC