RE: Overview of design decisions made creating stylesheet and sch ema for Artstor data from Seaborne, Andy on 2003-10-14 (www-rdf-dspace@w3.org from October 2003)

From: Seaborne, Andy <Andy_Seaborne@hplb.hpl.hp.com>
Date: Tue, 14 Oct 2003 11:14:51 +0100
To: "'Kevin Smathers'" <kevin.smathers@hp.com>
Cc: "'www-rdf-dspace@w3.org'" <www-rdf-dspace@w3.org>
Message-ID: <E864E95CB35C1C46B72FEA0626A2E8083025D0@0-mail-br1.hpl.hp.com>
Kevin wrote:
> objectA
> [
> rdf:type typeA
> b:propertyB "Yin"
> c:propertyC "valuec"
> ]
> 
> objectB
> [
> rdf:type typeB
> b:propertyB "Yang"
> d:propertyD "valued"
> ]
> 
> objectC
> [
> rdf:type someEquivalenceType
>  equivalent <objectA>
>  equivalent <objectB>
> ]

But this is a bit different.  It is talking about the metadata records, not
the thing being modelled because if <a> is equivalent to <b> the statement
about <a> are true about <b>

<a> <prop> "v" .
<a> owl:sameAs <b>
=>
<b> <prop> "v" .

In saying that the concept "Leonardo da Vinci" is the same in two corpuses
we are saying that statements in one are true in the other.  This is both
desirable and a real nuisance.   Sometimes you want provenance, sometime
not.  It is where the logic nature of RDF means we can not think of it as a
datastructure with local meaning only.

This is one uses of quads for data management and one-level provenance
tracking.  There are several different ways to use the fourth slot, not all
the same either.

In the specific demo scenario we are focusing on, then if we have (an image
of) a work of art by "Leonardo da Vinci" and a biograph about him, we do
want to say they are the same person.  That is increasing the value of the
information by making such relationship explicit.

Recording all the equivalences means that at a single point on the web,
someone (something) is recording all these mappings.  But is A = B, B = C
then by stating this through a concept it merges that A = C.  

If that concept is what your objectC then we are doing the same thing and a
consequence is
 objectA b:propertyB "Yang"
because objectA equivalent objectB means it is the same thing.

If that objectC is recording lists of pairwise mappings, this does not
happen without A and C being brought together somewhere.  In a global
federated system, this isn't practical.

	Andy

-----Original Message-----
From: Kevin Smathers [mailto:kevin.smathers@hp.com] 
Sent: 13 October 2003 18:36
To: Butler, Mark
Cc: 'www-rdf-dspace@w3.org'
Subject: Re: Overview of design decisions made creating stylesheet and sch
ema for Artstor data



Butler, Mark wrote:

>Hi Kevin,
>
>  
>
>>>Kevin writes in regard to Andy's suggestion
>>>      
>>>
>
>  
>
>>Creating an class for Person is fine, but combining multiple schemas 
>>into the same Person object I think is an error.  
>>    
>>
>
>then later you say
>
>  
>
>>in other words, instead of extending Person with the 
>>contents 
>>of each new corpus, each new corpus can maintain its own 
>>Person class, 
>>each with its own meaning,
>>    
>>
>
>I don't think this is a problem, because RDF supports multiple inheritance,
>so each new corpus can still maintain its own Person class. We have a
single
>URI, that represents the concept of Leonardo da Vinci, and this can be an
>instance of several different classes concurrently, with the properties
>necessary to be members of each class. The important point is identifying
>these instances apply to the same individual, and indicating that via the
>URI. This is what your SoundExSimilarPerson and GettyULANPerson classes are
>doing, right? We also get deconflict automatically as the properties are in
>different namespaces. 
>
>To put it another way, 
>
>objectA
>[
>rdf:type typeB
>rdf:type typeC
>b:propertyD "value1"
>c:propertyE "value2"
>]
>
>is equivalent to
>
>objectD
>[
>rdf:type typeB
>b:propertyD "value1"
>b:sameAs objectE
>]
>
>objectE
>[
>rdf:type typeC
>c:propertyE "value2"
>c:sameAs objectD
>]
>
>right?
>

I agree that the two cases that you show are of equivalently expressive, 
but I wasn't talking about multiple classification.  In cases where 
objectA and objectB are independently developed, the semantic value of 
some propertyB is likely to vary even when referring to the same 
property.  Andy proposes moving the discordant element into an new 
property that is a schema-specific identifier, but the way I would model 
it is that the instances remain seperate, in other words:

objectA
[
rdf:type typeA
b:propertyB "Yin"
c:propertyC "valuec"
]

objectB
[
rdf:type typeB
b:propertyB "Yang"
d:propertyD "valued"
]

objectC
[
rdf:type someEquivalenceType
:equivalent <objectA>
:equivalent <objectB>
]

In your example objectA is inextricably both typeB and typeC.  Thus in 
your example instances of typeB can be equivalent to instances of typeC 
for only one sense of equivalence -- there can't be any conflicts (one 
references Getty, another references some homebrew canonical 
transformation), nor can objectA take one equivalence with different 
objects depending on the context of the equivalence.

>
>  
>
>>Rather than replace the original meaning, what you 
>>need is to apply an adaptor pattern to adjust the meaning to a new 
>>context; 
>>    
>>
>
>By adaptor pattern, do you invisage an ontology (OWL) or RDFS document, or
>do you mean a programmatic description?
>

Here I'm trying to develop a theory for handling opposing theories of 
classification.  Again, Andy's approach, if I understand correctly, is 
to rationalize the opposing views -- that is to choose a dominant view, 
and relegate sub-dominant views to historical references.  By using an 
adaptor pattern what I propose is that each data source should be able 
to maintain its own dominant view, with adaptive extensions to allow it 
to be queried in the opposing domain.  In other words, a library that, 
for example, indexes its collections in Library of Congress should 
continue to see the Library of Congress identifier as the primary 
identifier of its records, but those records could be mapped for use 
interlibrary to a library that indexes using Dewey Decimal identifiers 
by an adaptive wrapper around the original instance.  The adaptive 
wrapper adds flexibility in the mapping and can conceivably be 
instantiated differently for each peer that would like to see Dewey 
Decimal numbers.  (Feel free to replace LOC, or Dewey with e.g URL's, 
ISBN, or UPC numbers.)


>
>One reason we might want to use an adaptor pattern is it allows us to
>normalize the data. We are used to the idea of normalizing data in
>relational databases, but the idea is also applicable to XML - see [1] and
>[2] - and I hypothesise RDF. It seems counterintuitive to talk about
>normalization in RDF, because if we pick our first class entities
correctly,
>we get normalization for free, but I guess by thinking about (RDF) models
>from a normalization perspective we can check how well designed a model is.

>  
>

I'm not sure that there is any 'correct' set of first class entities 
that can be determined a-priori.  Philosophically this is is a question 
of episteme; the root assumptions provide the context within which to 
select the first class entities, but those first class entities will of 
necessity be different from the classes chosen by people operating in a 
distinct paradigm.  Certain epistemological systems have shown
great durability in the face of change, but specialized contexts will 
always require specialized classification which can be of value to the 
users of that system even when its classifications seem absurd or 
nonsensical in the context of one of the common durable systems.

>When we map between corpori, and come up with representations of
individuals
>that combine multiple vocabularies similar to those above, we can consider
>normalization also. Clearly an instance having multiple properties,
>associated with different namespaces, that contains duplicates of the same
>value is a bad idea. Where there is a consistent duplication, we could omit
>properties and use inference and subproperty relations instead. 
>
>However compound relations are more complicated e.g. in Andy's example
there
>is a relation between artstorID and familyName, givenName, dateOfBirth,
>dateOfDeath. In the subsequent discussion, let's call the latter the
>galleryCard representation (because its similar to vCard but we have
DOB/DOD
>also). The relationship between artstorID and the galleryCard
representation
>is more complicated one way than the other: to go from artstorID to
>galleryCard we have to do some kind of tokenization, which is potentially
>unreliable. However to move from the galleryCard to artstorID is easier
>because we just aggregate.
>
>Therefore to perform normalization, it seems attractive to take artstorID
at
>ingest, break it in to galleryCard, and then implement some kind of viewer
>to aggregate back to the artstorID representation. We can represent both
>relations between the galleryCard properties and artstorID
programmatically,
>but I don't think we can indicate such relations using languages like OWL -
>perhaps an OWL expert can correct me here if I'm wrong?
>
>However I think there is another design principle here that overrides the
>need for normalization. Historians talk about primary and secondary
sources,
>so the problem with using the split at ingest / reaggregate is we have
>thrown away a primary source and are rebuilding it using a secondary
source.
>Despite the need for normalization, this seems a bad idea. So I think it is
>okay to split to galleryCard at ingest, but I'm keen for us to keep the
>original "Leonardo,da Vinci,1452-1519" as well. 
>

It is sometimes very difficult to talk about this without sounding 
absurd, but consider the following if you can.  Suppose there is a 
school of the occult that teaches that every soul goes through multiple 
incarnations, and just for the sake of argument, let's suppose that they 
had through some divine means determined that J.S. Bach, and Elvis 
happened to be the same person (qua soul).  So they diligently enter 
that 'fact' into their database.  While that representation undoubtably 
might have value to the school of the occult, it is unlikely that most 
other schools would have any use for that information.  Clearly, even 
though the epistemological systems interact, they must not inadvertently 
pollute the other systems.  The decision of the occult school to join 
together those records should be available but ignored unless you are 
working in the context of the occult.

My argument is that things like this occur to a lesser degree all the 
time.  Equivalence shouldn't be expressed by multiple classification 
because it is too final; rather equivalence should be expressed by 
indexing where the index can be maintained by the organizations that are 
interested. 

>
>[1] Normalizing XML, part 1, Will Provost, XML.com
>http://www.xml.com/pub/a/2002/11/13/normalizing.html
>
>[2] Normalizing XML, part 2, Will Provost, XML.com,
>http://www.xml.com/pub/a/2002/12/04/normalizing.html
>
>Dr Mark H. Butler
>Research Scientist                HP Labs Bristol
>mark-h_butler@hp.com
>Internet: http://www-uk.hpl.hp.com/people/marbut/
>  
>


-- 
========================================================
   Kevin Smathers                kevin.smathers@hp.com    
   Hewlett-Packard               kevin@ank.com            
   Palo Alto Research Lab                                 
   1501 Page Mill Rd.            650-857-4477 work        
   M/S 1135                      650-852-8186 fax         
   Palo Alto, CA 94304           510-247-1031 home        
========================================================
use "Standard::Disclaimer";
carp("This message was printed on 100% recycled bits.");
Received on Tuesday, 14 October 2003 06:17:53 UTC