Re: Overview of design decisions made creating stylesheet and sch ema for Artstor data from Kevin Smathers on 2003-10-14 (www-rdf-dspace@w3.org from October 2003)

From: Kevin Smathers <kevin.smathers@hp.com>
Date: Tue, 14 Oct 2003 09:54:38 -0700
To: "Seaborne, Andy" <Andy_Seaborne@hplb.hpl.hp.com>
Cc: "'www-rdf-dspace@w3.org'" <www-rdf-dspace@w3.org>
Message-ID: <3F8C2A4E.9050909@hp.com>
Seaborne, Andy wrote:

>Kevin wrote:
>  
>
>>objectA
>>[
>>rdf:type typeA
>>b:propertyB "Yin"
>>c:propertyC "valuec"
>>]
>>
>>objectB
>>[
>>rdf:type typeB
>>b:propertyB "Yang"
>>d:propertyD "valued"
>>]
>>
>>objectC
>>[
>>rdf:type someEquivalenceType
>> equivalent <objectA>
>> equivalent <objectB>
>>]
>>    
>>
>
>But this is a bit different.  It is talking about the metadata records, not
>the thing being modelled because if <a> is equivalent to <b> the statement
>about <a> are true about <b>
>
><a> <prop> "v" .
><a> owl:sameAs <b>
>=>
><b> <prop> "v" .
>
>In saying that the concept "Leonardo da Vinci" is the same in two corpuses
>we are saying that statements in one are true in the other.  This is both
>desirable and a real nuisance.   Sometimes you want provenance, sometime
>not.  It is where the logic nature of RDF means we can not think of it as a
>datastructure with local meaning only.
>
>This is one uses of quads for data management and one-level provenance
>tracking.  There are several different ways to use the fourth slot, not all
>the same either.
>
>In the specific demo scenario we are focusing on, then if we have (an image
>of) a work of art by "Leonardo da Vinci" and a biograph about him, we do
>want to say they are the same person.  That is increasing the value of the
>information by making such relationship explicit.
>
>Recording all the equivalences means that at a single point on the web,
>someone (something) is recording all these mappings.  But is A = B, B = C
>then by stating this through a concept it merges that A = C.  
>
>If that concept is what your objectC then we are doing the same thing and a
>consequence is
> objectA b:propertyB "Yang"
>because objectA equivalent objectB means it is the same thing.
>
>If that objectC is recording lists of pairwise mappings, this does not
>happen without A and C being brought together somewhere.  In a global
>federated system, this isn't practical.
>  
>
For the equivalence listed, both objectA and objectB should be included 
in an index of propertyB under the terms "Yin", and "Yang", but unlike 
an owl equivalence, the property value "Yang" need not be visible within 
objectA even though it is indexed in that position.  For the purposes of 
this equivalence, 'Yin' and 'Yang' are considered synonymous and thus 
duplicate one another. 

In a different equivalence context 'Yin' and 'Yang' might be considered 
opposites, and thus objectA and objectB would not be indexed in the same 
positions for propertyB. 

This is all independent of whether objectA takes on the properties of 
objectB.  Consider the question 'how do I display the retrieved 
record?'   When looking at a multiply classified object the answer is 
ambiguous -- the subgraph can be instantiated as either of two classes: 
typeA, or typeB.  If the two instances are distinct then there is no 
such confusion.  Of course it is always possible to build a rationalized 
record that integrates the properties of typeA and typeB to the extent 
that they don't conflictm or by remapping conflicting records to a new 
property e.g: typeB :propertyB becomes typeA :identifierFromB, but this 
approach requires all users of typeB to modify their usage of the 
combined records in order to continue using the conflicting element with 
its original semantics.

 

>	Andy
>
>-----Original Message-----
>From: Kevin Smathers [mailto:kevin.smathers@hp.com] 
>Sent: 13 October 2003 18:36
>To: Butler, Mark
>Cc: 'www-rdf-dspace@w3.org'
>Subject: Re: Overview of design decisions made creating stylesheet and sch
>ema for Artstor data
>
>
>
>Butler, Mark wrote:
>
>  
>
>>Hi Kevin,
>>
>> 
>>
>>    
>>
>>>>Kevin writes in regard to Andy's suggestion
>>>>     
>>>>
>>>>        
>>>>
>> 
>>
>>    
>>
>>>Creating an class for Person is fine, but combining multiple schemas 
>>>into the same Person object I think is an error.  
>>>   
>>>
>>>      
>>>
>>then later you say
>>
>> 
>>
>>    
>>
>>>in other words, instead of extending Person with the 
>>>contents 
>>>of each new corpus, each new corpus can maintain its own 
>>>Person class, 
>>>each with its own meaning,
>>>   
>>>
>>>      
>>>
>>I don't think this is a problem, because RDF supports multiple inheritance,
>>so each new corpus can still maintain its own Person class. We have a
>>    
>>
>single
>  
>
>>URI, that represents the concept of Leonardo da Vinci, and this can be an
>>instance of several different classes concurrently, with the properties
>>necessary to be members of each class. The important point is identifying
>>these instances apply to the same individual, and indicating that via the
>>URI. This is what your SoundExSimilarPerson and GettyULANPerson classes are
>>doing, right? We also get deconflict automatically as the properties are in
>>different namespaces. 
>>
>>To put it another way, 
>>
>>objectA
>>[
>>rdf:type typeB
>>rdf:type typeC
>>b:propertyD "value1"
>>c:propertyE "value2"
>>]
>>
>>is equivalent to
>>
>>objectD
>>[
>>rdf:type typeB
>>b:propertyD "value1"
>>b:sameAs objectE
>>]
>>
>>objectE
>>[
>>rdf:type typeC
>>c:propertyE "value2"
>>c:sameAs objectD
>>]
>>
>>right?
>>
>>    
>>
>
>I agree that the two cases that you show are of equivalently expressive, 
>but I wasn't talking about multiple classification.  In cases where 
>objectA and objectB are independently developed, the semantic value of 
>some propertyB is likely to vary even when referring to the same 
>property.  Andy proposes moving the discordant element into an new 
>property that is a schema-specific identifier, but the way I would model 
>it is that the instances remain seperate, in other words:
>
>objectA
>[
>rdf:type typeA
>b:propertyB "Yin"
>c:propertyC "valuec"
>]
>
>objectB
>[
>rdf:type typeB
>b:propertyB "Yang"
>d:propertyD "valued"
>]
>
>objectC
>[
>rdf:type someEquivalenceType
>:equivalent <objectA>
>:equivalent <objectB>
>]
>
>In your example objectA is inextricably both typeB and typeC.  Thus in 
>your example instances of typeB can be equivalent to instances of typeC 
>for only one sense of equivalence -- there can't be any conflicts (one 
>references Getty, another references some homebrew canonical 
>transformation), nor can objectA take one equivalence with different 
>objects depending on the context of the equivalence.
>
>  
>
>> 
>>
>>    
>>
>>>Rather than replace the original meaning, what you 
>>>need is to apply an adaptor pattern to adjust the meaning to a new 
>>>context; 
>>>   
>>>
>>>      
>>>
>>By adaptor pattern, do you invisage an ontology (OWL) or RDFS document, or
>>do you mean a programmatic description?
>>
>>    
>>
>
>Here I'm trying to develop a theory for handling opposing theories of 
>classification.  Again, Andy's approach, if I understand correctly, is 
>to rationalize the opposing views -- that is to choose a dominant view, 
>and relegate sub-dominant views to historical references.  By using an 
>adaptor pattern what I propose is that each data source should be able 
>to maintain its own dominant view, with adaptive extensions to allow it 
>to be queried in the opposing domain.  In other words, a library that, 
>for example, indexes its collections in Library of Congress should 
>continue to see the Library of Congress identifier as the primary 
>identifier of its records, but those records could be mapped for use 
>interlibrary to a library that indexes using Dewey Decimal identifiers 
>by an adaptive wrapper around the original instance.  The adaptive 
>wrapper adds flexibility in the mapping and can conceivably be 
>instantiated differently for each peer that would like to see Dewey 
>Decimal numbers.  (Feel free to replace LOC, or Dewey with e.g URL's, 
>ISBN, or UPC numbers.)
>
>
>  
>
>>One reason we might want to use an adaptor pattern is it allows us to
>>normalize the data. We are used to the idea of normalizing data in
>>relational databases, but the idea is also applicable to XML - see [1] and
>>[2] - and I hypothesise RDF. It seems counterintuitive to talk about
>>normalization in RDF, because if we pick our first class entities
>>    
>>
>correctly,
>  
>
>>we get normalization for free, but I guess by thinking about (RDF) models
>>    
>>
>>from a normalization perspective we can check how well designed a model is.
>
>  
>
>> 
>>
>>    
>>
>
>I'm not sure that there is any 'correct' set of first class entities 
>that can be determined a-priori.  Philosophically this is is a question 
>of episteme; the root assumptions provide the context within which to 
>select the first class entities, but those first class entities will of 
>necessity be different from the classes chosen by people operating in a 
>distinct paradigm.  Certain epistemological systems have shown
>great durability in the face of change, but specialized contexts will 
>always require specialized classification which can be of value to the 
>users of that system even when its classifications seem absurd or 
>nonsensical in the context of one of the common durable systems.
>
>  
>
>>When we map between corpori, and come up with representations of
>>    
>>
>individuals
>  
>
>>that combine multiple vocabularies similar to those above, we can consider
>>normalization also. Clearly an instance having multiple properties,
>>associated with different namespaces, that contains duplicates of the same
>>value is a bad idea. Where there is a consistent duplication, we could omit
>>properties and use inference and subproperty relations instead. 
>>
>>However compound relations are more complicated e.g. in Andy's example
>>    
>>
>there
>  
>
>>is a relation between artstorID and familyName, givenName, dateOfBirth,
>>dateOfDeath. In the subsequent discussion, let's call the latter the
>>galleryCard representation (because its similar to vCard but we have
>>    
>>
>DOB/DOD
>  
>
>>also). The relationship between artstorID and the galleryCard
>>    
>>
>representation
>  
>
>>is more complicated one way than the other: to go from artstorID to
>>galleryCard we have to do some kind of tokenization, which is potentially
>>unreliable. However to move from the galleryCard to artstorID is easier
>>because we just aggregate.
>>
>>Therefore to perform normalization, it seems attractive to take artstorID
>>    
>>
>at
>  
>
>>ingest, break it in to galleryCard, and then implement some kind of viewer
>>to aggregate back to the artstorID representation. We can represent both
>>relations between the galleryCard properties and artstorID
>>    
>>
>programmatically,
>  
>
>>but I don't think we can indicate such relations using languages like OWL -
>>perhaps an OWL expert can correct me here if I'm wrong?
>>
>>However I think there is another design principle here that overrides the
>>need for normalization. Historians talk about primary and secondary
>>    
>>
>sources,
>  
>
>>so the problem with using the split at ingest / reaggregate is we have
>>thrown away a primary source and are rebuilding it using a secondary
>>    
>>
>source.
>  
>
>>Despite the need for normalization, this seems a bad idea. So I think it is
>>okay to split to galleryCard at ingest, but I'm keen for us to keep the
>>original "Leonardo,da Vinci,1452-1519" as well. 
>>
>>    
>>
>
>It is sometimes very difficult to talk about this without sounding 
>absurd, but consider the following if you can.  Suppose there is a 
>school of the occult that teaches that every soul goes through multiple 
>incarnations, and just for the sake of argument, let's suppose that they 
>had through some divine means determined that J.S. Bach, and Elvis 
>happened to be the same person (qua soul).  So they diligently enter 
>that 'fact' into their database.  While that representation undoubtably 
>might have value to the school of the occult, it is unlikely that most 
>other schools would have any use for that information.  Clearly, even 
>though the epistemological systems interact, they must not inadvertently 
>pollute the other systems.  The decision of the occult school to join 
>together those records should be available but ignored unless you are 
>working in the context of the occult.
>
>My argument is that things like this occur to a lesser degree all the 
>time.  Equivalence shouldn't be expressed by multiple classification 
>because it is too final; rather equivalence should be expressed by 
>indexing where the index can be maintained by the organizations that are 
>interested. 
>
>  
>
>>[1] Normalizing XML, part 1, Will Provost, XML.com
>>http://www.xml.com/pub/a/2002/11/13/normalizing.html
>>
>>[2] Normalizing XML, part 2, Will Provost, XML.com,
>>http://www.xml.com/pub/a/2002/12/04/normalizing.html
>>
>>Dr Mark H. Butler
>>Research Scientist                HP Labs Bristol
>>mark-h_butler@hp.com
>>Internet: http://www-uk.hpl.hp.com/people/marbut/
>> 
>>
>>    
>>
>
>
>  
>


-- 
========================================================
   Kevin Smathers                kevin.smathers@hp.com    
   Hewlett-Packard               kevin@ank.com            
   Palo Alto Research Lab                                 
   1501 Page Mill Rd.            650-857-4477 work        
   M/S 1135                      650-852-8186 fax         
   Palo Alto, CA 94304           510-247-1031 home        
========================================================
use "Standard::Disclaimer";
carp("This message was printed on 100% recycled bits.");
Received on Tuesday, 14 October 2003 12:56:12 UTC