- From: Butler, Mark <mark-h.butler@hp.com>
- Date: Tue, 4 May 2004 14:32:55 +0100
- To: SIMILE public list <www-rdf-dspace@w3.org>
-----Original Message----- From: David R. Karger [mailto:karger@mit.edu] Sent: 30 April 2004 18:16 To: kevin.smathers@hp.com Cc: Mark_Butler@hplb.hpl.hp.com Subject: Re: Overview of design decisions made creating stylesheet and sch ema for Artstor data see end. X-Original-To: www-rdf-dspace@frink.w3.org Date: Mon, 13 Oct 2003 10:34:13 -0700 From: Kevin Smathers <kevin.smathers@hp.com> X-Accept-Language: en-us, en Cc: "'www-rdf-dspace@w3.org'" <www-rdf-dspace@w3.org> X-Archived-At: http://www.w3.org/mid/3F8AE215.5040505@hp.com X-Mailing-List: <www-rdf-dspace@w3.org> archive/latest/643 X-Loop: www-rdf-dspace@w3.org X-SBClass: Nonlocal Origin [156.153.255.237] X-Spam-Status: No, hits=-1.5 required=5.0 tests=IN_REP_TO,SUBJ_HAS_SPACES,FOR_FREE version=2.20 X-Spam-Level: X-SpamBouncer: 1.7 (8/28/03) X-SBPass: NoBounce X-SBClass: OK X-Folder: Bulk X-Status: X-Keywords: X-UID: 40 Butler, Mark wrote: >Hi Kevin, > > > >>>Kevin writes in regard to Andy's suggestion >>> >>> > > > >>Creating an class for Person is fine, but combining multiple schemas >>into the same Person object I think is an error. >> >> > >then later you say > > > >>in other words, instead of extending Person with the >>contents >>of each new corpus, each new corpus can maintain its own >>Person class, >>each with its own meaning, >> >> > >I don't think this is a problem, because RDF supports multiple inheritance, >so each new corpus can still maintain its own Person class. We have a single >URI, that represents the concept of Leonardo da Vinci, and this can be an >instance of several different classes concurrently, with the properties >necessary to be members of each class. The important point is identifying >these instances apply to the same individual, and indicating that via the >URI. This is what your SoundExSimilarPerson and GettyULANPerson classes are >doing, right? We also get deconflict automatically as the properties are in >different namespaces. > >To put it another way, > >objectA >[ >rdf:type typeB >rdf:type typeC >b:propertyD "value1" >c:propertyE "value2" >] > >is equivalent to > >objectD >[ >rdf:type typeB >b:propertyD "value1" >b:sameAs objectE >] > >objectE >[ >rdf:type typeC >c:propertyE "value2" >c:sameAs objectD >] > >right? > I agree that the two cases that you show are of equivalently expressive, but I wasn't talking about multiple classification. In cases where objectA and objectB are independently developed, the semantic value of some propertyB is likely to vary even when referring to the same property. Andy proposes moving the discordant element into an new property that is a schema-specific identifier, but the way I would model it is that the instances remain seperate, in other words: objectA [ rdf:type typeA b:propertyB "Yin" c:propertyC "valuec" ] objectB [ rdf:type typeB b:propertyB "Yang" d:propertyD "valued" ] objectC [ rdf:type someEquivalenceType :equivalent <objectA> :equivalent <objectB> ] In your example objectA is inextricably both typeB and typeC. Thus in your example instances of typeB can be equivalent to instances of typeC for only one sense of equivalence -- there can't be any conflicts (one references Getty, another references some homebrew canonical transformation), nor can objectA take one equivalence with different objects depending on the context of the equivalence. > > > >>Rather than replace the original meaning, what you >>need is to apply an adaptor pattern to adjust the meaning to a new >>context; >> >> > >By adaptor pattern, do you invisage an ontology (OWL) or RDFS document, or >do you mean a programmatic description? > Here I'm trying to develop a theory for handling opposing theories of classification. Again, Andy's approach, if I understand correctly, is to rationalize the opposing views -- that is to choose a dominant view, and relegate sub-dominant views to historical references. By using an adaptor pattern what I propose is that each data source should be able to maintain its own dominant view, with adaptive extensions to allow it to be queried in the opposing domain. In other words, a library that, for example, indexes its collections in Library of Congress should continue to see the Library of Congress identifier as the primary identifier of its records, but those records could be mapped for use interlibrary to a library that indexes using Dewey Decimal identifiers by an adaptive wrapper around the original instance. The adaptive wrapper adds flexibility in the mapping and can conceivably be instantiated differently for each peer that would like to see Dewey Decimal numbers. (Feel free to replace LOC, or Dewey with e.g URL's, ISBN, or UPC numbers.) > >One reason we might want to use an adaptor pattern is it allows us to >normalize the data. We are used to the idea of normalizing data in >relational databases, but the idea is also applicable to XML - see [1] and >[2] - and I hypothesise RDF. It seems counterintuitive to talk about >normalization in RDF, because if we pick our first class entities correctly, >we get normalization for free, but I guess by thinking about (RDF) models >from a normalization perspective we can check how well designed a model is. > > I'm not sure that there is any 'correct' set of first class entities that can be determined a-priori. Philosophically this is is a question of episteme; the root assumptions provide the context within which to select the first class entities, but those first class entities will of necessity be different from the classes chosen by people operating in a distinct paradigm. Certain epistemological systems have shown great durability in the face of change, but specialized contexts will always require specialized classification which can be of value to the users of that system even when its classifications seem absurd or nonsensical in the context of one of the common durable systems. >When we map between corpori, and come up with representations of individuals >that combine multiple vocabularies similar to those above, we can consider >normalization also. Clearly an instance having multiple properties, >associated with different namespaces, that contains duplicates of the same >value is a bad idea. Where there is a consistent duplication, we could omit >properties and use inference and subproperty relations instead. > >However compound relations are more complicated e.g. in Andy's example there >is a relation between artstorID and familyName, givenName, dateOfBirth, >dateOfDeath. In the subsequent discussion, let's call the latter the >galleryCard representation (because its similar to vCard but we have DOB/DOD >also). The relationship between artstorID and the galleryCard representation >is more complicated one way than the other: to go from artstorID to >galleryCard we have to do some kind of tokenization, which is potentially >unreliable. However to move from the galleryCard to artstorID is easier >because we just aggregate. > >Therefore to perform normalization, it seems attractive to take artstorID at >ingest, break it in to galleryCard, and then implement some kind of viewer >to aggregate back to the artstorID representation. We can represent both >relations between the galleryCard properties and artstorID programmatically, >but I don't think we can indicate such relations using languages like OWL - >perhaps an OWL expert can correct me here if I'm wrong? > >However I think there is another design principle here that overrides the >need for normalization. Historians talk about primary and secondary sources, >so the problem with using the split at ingest / reaggregate is we have >thrown away a primary source and are rebuilding it using a secondary source. >Despite the need for normalization, this seems a bad idea. So I think it is >okay to split to galleryCard at ingest, but I'm keen for us to keep the >original "Leonardo,da Vinci,1452-1519" as well. > It is sometimes very difficult to talk about this without sounding absurd, but consider the following if you can. Suppose there is a school of the occult that teaches that every soul goes through multiple incarnations, and just for the sake of argument, let's suppose that they had through some divine means determined that J.S. Bach, and Elvis happened to be the same person (qua soul). So they diligently enter that 'fact' into their database. While that representation undoubtably might have value to the school of the occult, it is unlikely that most other schools would have any use for that information. Clearly, even though the epistemological systems interact, they must not inadvertently pollute the other systems. The decision of the occult school to join together those records should be available but ignored unless you are working in the context of the occult. My argument is that things like this occur to a lesser degree all the time. Equivalence shouldn't be expressed by multiple classification because it is too final; rather equivalence should be expressed by indexing where the index can be maintained by the organizations that are interested. Granted that it is "final" to merge two asserted-equivalent objects into one, it is also extremely efficient. While there is great expressive value in using predicates that assert equivalence between different objects, I think in practice we will want a preprocessing stage that takes out "base" data, determines equivalences, and coins new single-resource representations of each equivalence class. When a new assertion of equivalence arrives, we have to revise our resources, but this will hopefully be rare. I fear that explicitly managing inference over equivalence in realtime could just get too complex. d
Received on Tuesday, 4 May 2004 09:33:55 UTC