FW: Overview of design decisions made creating stylesheet and sch ema for Artstor data

-----Original Message-----
From: David R. Karger [mailto:karger@mit.edu] 
Sent: 30 April 2004 18:16
To: kevin.smathers@hp.com
Cc: Mark_Butler@hplb.hpl.hp.com
Subject: Re: Overview of design decisions made creating stylesheet and sch
ema for Artstor data



see end.

   X-Original-To: www-rdf-dspace@frink.w3.org
   Date: Mon, 13 Oct 2003 10:34:13 -0700
   From: Kevin Smathers <kevin.smathers@hp.com>
   X-Accept-Language: en-us, en
   Cc: "'www-rdf-dspace@w3.org'" <www-rdf-dspace@w3.org>
   X-Archived-At: http://www.w3.org/mid/3F8AE215.5040505@hp.com
   X-Mailing-List: <www-rdf-dspace@w3.org> archive/latest/643
   X-Loop: www-rdf-dspace@w3.org
   X-SBClass: Nonlocal Origin [156.153.255.237]
   X-Spam-Status: No, hits=-1.5 required=5.0
tests=IN_REP_TO,SUBJ_HAS_SPACES,FOR_FREE version=2.20
   X-Spam-Level: 
   X-SpamBouncer: 1.7 (8/28/03)
   X-SBPass: NoBounce
   X-SBClass: OK
   X-Folder: Bulk
   X-Status: 
   X-Keywords:                 
   X-UID: 40


   Butler, Mark wrote:

   >Hi Kevin,
   >
   >  
   >
   >>>Kevin writes in regard to Andy's suggestion
   >>>      
   >>>
   >
   >  
   >
   >>Creating an class for Person is fine, but combining multiple schemas 
   >>into the same Person object I think is an error.  
   >>    
   >>
   >
   >then later you say
   >
   >  
   >
   >>in other words, instead of extending Person with the 
   >>contents 
   >>of each new corpus, each new corpus can maintain its own 
   >>Person class, 
   >>each with its own meaning,
   >>    
   >>
   >
   >I don't think this is a problem, because RDF supports multiple
inheritance,
   >so each new corpus can still maintain its own Person class. We have a
single
   >URI, that represents the concept of Leonardo da Vinci, and this can be
an
   >instance of several different classes concurrently, with the properties
   >necessary to be members of each class. The important point is
identifying
   >these instances apply to the same individual, and indicating that via
the
   >URI. This is what your SoundExSimilarPerson and GettyULANPerson classes
are
   >doing, right? We also get deconflict automatically as the properties are
in
   >different namespaces. 
   >
   >To put it another way, 
   >
   >objectA
   >[
   >rdf:type typeB
   >rdf:type typeC
   >b:propertyD "value1"
   >c:propertyE "value2"
   >]
   >
   >is equivalent to
   >
   >objectD
   >[
   >rdf:type typeB
   >b:propertyD "value1"
   >b:sameAs objectE
   >]
   >
   >objectE
   >[
   >rdf:type typeC
   >c:propertyE "value2"
   >c:sameAs objectD
   >]
   >
   >right?
   >

   I agree that the two cases that you show are of equivalently expressive, 
   but I wasn't talking about multiple classification.  In cases where 
   objectA and objectB are independently developed, the semantic value of 
   some propertyB is likely to vary even when referring to the same 
   property.  Andy proposes moving the discordant element into an new 
   property that is a schema-specific identifier, but the way I would model 
   it is that the instances remain seperate, in other words:

   objectA
   [
   rdf:type typeA
   b:propertyB "Yin"
   c:propertyC "valuec"
   ]

   objectB
   [
   rdf:type typeB
   b:propertyB "Yang"
   d:propertyD "valued"
   ]

   objectC
   [
   rdf:type someEquivalenceType
   :equivalent <objectA>
   :equivalent <objectB>
   ]

   In your example objectA is inextricably both typeB and typeC.  Thus in 
   your example instances of typeB can be equivalent to instances of typeC 
   for only one sense of equivalence -- there can't be any conflicts (one 
   references Getty, another references some homebrew canonical 
   transformation), nor can objectA take one equivalence with different 
   objects depending on the context of the equivalence.

   >
   >  
   >
   >>Rather than replace the original meaning, what you 
   >>need is to apply an adaptor pattern to adjust the meaning to a new 
   >>context; 
   >>    
   >>
   >
   >By adaptor pattern, do you invisage an ontology (OWL) or RDFS document,
or
   >do you mean a programmatic description?
   >

   Here I'm trying to develop a theory for handling opposing theories of 
   classification.  Again, Andy's approach, if I understand correctly, is 
   to rationalize the opposing views -- that is to choose a dominant view, 
   and relegate sub-dominant views to historical references.  By using an 
   adaptor pattern what I propose is that each data source should be able 
   to maintain its own dominant view, with adaptive extensions to allow it 
   to be queried in the opposing domain.  In other words, a library that, 
   for example, indexes its collections in Library of Congress should 
   continue to see the Library of Congress identifier as the primary 
   identifier of its records, but those records could be mapped for use 
   interlibrary to a library that indexes using Dewey Decimal identifiers 
   by an adaptive wrapper around the original instance.  The adaptive 
   wrapper adds flexibility in the mapping and can conceivably be 
   instantiated differently for each peer that would like to see Dewey 
   Decimal numbers.  (Feel free to replace LOC, or Dewey with e.g URL's, 
   ISBN, or UPC numbers.)


   >
   >One reason we might want to use an adaptor pattern is it allows us to
   >normalize the data. We are used to the idea of normalizing data in
   >relational databases, but the idea is also applicable to XML - see [1]
and
   >[2] - and I hypothesise RDF. It seems counterintuitive to talk about
   >normalization in RDF, because if we pick our first class entities
correctly,
   >we get normalization for free, but I guess by thinking about (RDF)
models
   >from a normalization perspective we can check how well designed a model
is. 
   >  
   >

   I'm not sure that there is any 'correct' set of first class entities 
   that can be determined a-priori.  Philosophically this is is a question 
   of episteme; the root assumptions provide the context within which to 
   select the first class entities, but those first class entities will of 
   necessity be different from the classes chosen by people operating in a 
   distinct paradigm.  Certain epistemological systems have shown
   great durability in the face of change, but specialized contexts will 
   always require specialized classification which can be of value to the 
   users of that system even when its classifications seem absurd or 
   nonsensical in the context of one of the common durable systems.

   >When we map between corpori, and come up with representations of
individuals
   >that combine multiple vocabularies similar to those above, we can
consider
   >normalization also. Clearly an instance having multiple properties,
   >associated with different namespaces, that contains duplicates of the
same
   >value is a bad idea. Where there is a consistent duplication, we could
omit
   >properties and use inference and subproperty relations instead. 
   >
   >However compound relations are more complicated e.g. in Andy's example
there
   >is a relation between artstorID and familyName, givenName, dateOfBirth,
   >dateOfDeath. In the subsequent discussion, let's call the latter the
   >galleryCard representation (because its similar to vCard but we have
DOB/DOD
   >also). The relationship between artstorID and the galleryCard
representation
   >is more complicated one way than the other: to go from artstorID to
   >galleryCard we have to do some kind of tokenization, which is
potentially
   >unreliable. However to move from the galleryCard to artstorID is easier
   >because we just aggregate.
   >
   >Therefore to perform normalization, it seems attractive to take
artstorID at
   >ingest, break it in to galleryCard, and then implement some kind of
viewer
   >to aggregate back to the artstorID representation. We can represent both
   >relations between the galleryCard properties and artstorID
programmatically,
   >but I don't think we can indicate such relations using languages like
OWL -
   >perhaps an OWL expert can correct me here if I'm wrong?
   >
   >However I think there is another design principle here that overrides
the
   >need for normalization. Historians talk about primary and secondary
sources,
   >so the problem with using the split at ingest / reaggregate is we have
   >thrown away a primary source and are rebuilding it using a secondary
source.
   >Despite the need for normalization, this seems a bad idea. So I think it
is
   >okay to split to galleryCard at ingest, but I'm keen for us to keep the
   >original "Leonardo,da Vinci,1452-1519" as well. 
   >

   It is sometimes very difficult to talk about this without sounding 
   absurd, but consider the following if you can.  Suppose there is a 
   school of the occult that teaches that every soul goes through multiple 
   incarnations, and just for the sake of argument, let's suppose that they 
   had through some divine means determined that J.S. Bach, and Elvis 
   happened to be the same person (qua soul).  So they diligently enter 
   that 'fact' into their database.  While that representation undoubtably 
   might have value to the school of the occult, it is unlikely that most 
   other schools would have any use for that information.  Clearly, even 
   though the epistemological systems interact, they must not inadvertently 
   pollute the other systems.  The decision of the occult school to join 
   together those records should be available but ignored unless you are 
   working in the context of the occult.

   My argument is that things like this occur to a lesser degree all the 
   time.  Equivalence shouldn't be expressed by multiple classification 
   because it is too final; rather equivalence should be expressed by 
   indexing where the index can be maintained by the organizations that are 
   interested. 

Granted that it is "final" to merge two asserted-equivalent objects into
one, it is also extremely efficient.  While there is great expressive value
in using predicates that assert equivalence between different objects, I
think in practice we will want a preprocessing stage that takes out "base"
data, determines equivalences, and coins new single-resource representations
of each equivalence class.  When a new assertion of equivalence arrives, we
have to revise our resources, but this will hopefully be rare.  I fear that
explicitly managing inference over equivalence in realtime could just get
too complex.

d

Received on Tuesday, 4 May 2004 09:33:55 UTC