Overview of design decisions made creating stylesheet and schema for Artstor data

Hi team, in prep for the telecon today. 

Here is an overview of the design decisions I made when creating the
stylesheet and schema for the Artstor data. 

Artstor.xsd describes records that contain the following nested elements:
Image, MediaFiles, MediaFile, Collection, MetaData, Relation, Source,
Style_Period, Creator, Date, Description, ID_Number, Location, Title,
Material, Measurements, Subject. 

The first problem was that Relation, Source, Style_Period, Creator, Date,
Description, ID_Number, Location, Title, Material, Measurements and Subject
are mixed mode i.e. they can contain text or other elements e.g.

<Creator>
	<Personal_Name>Leonardo,da Vinci,1452-1519</Personal_Name>
	<Corporate_Name>School of Leonardo</Corporate_Name>
</Creator>

or

<Creator>Leonardo,da Vinci,1452-1519</Creator>

are both acceptable. In fact, they can even do both at the same time, as
does <Subject> in the sample instance data

<Subject>Portraits 
  <Geographic>Italy</Geographic> 
  <Topic>Drawing</Topic> 
</Subject>

So the XSLT stylesheet had to use some tricks to get around this. Also the
schema will need to define properties in addition to classes where a mixed
mode nested element is converted to a class, e.g. a Creator element with
nested elements is an instance of the Creator class, but if it contains text
as well this is represented as a creator property. 

The second problem was deciding if the nested elements indicated
superproperties, classes or context - see earlier email. I decided that

- Image, MediaFiles/MediaFile, Collection, Relation and Creator are classes.

- Source is a superproperty used as a context e.g. there are two possible
instances of the Location element, one which is a nested element occurring
in MetaData, the other which is a text element occurring in Source. So I
inferred these two uses of Location are different, so Source is being used
here as a context. As there are only two uses of Location, decided to
replace <Source><Location> with a property called sourceLocation. 

- MetaData, Style_Period, Date, Description, ID_Number, Title, Material,
Measurements and Subject are super-properties. As they are super-properties,
it is possible to flatten out their subproperties they contain (see Andy's
note about why this is desirable) so for example instead of Style_Period
being a property of MetaData, it's a property of the Image class, which is a
subproperty of MetaData. One issue here is both ID_Number and Date can
contain the elements Current_Repository and Former_Repository. However the
meaning of these elements does not seem to change based on the nesting, so
this was resolved by creating two properties called current_repository and
former_repository that are subproperties of both id_number and date.

Outstanding issues:

1. Does everybody agree with the choice of classes, as this strongly
influences the model?

2. Andy raised some good issues about sub-properties / super-properties that
I have yet to resolve:

"I don't think there is a uniform approach because the 
original specs aren't uniform in use of a "qualifier".

Looking at vra3:title:

vra3:Title.Variant
    could be subProperty of title
    its still a title for the work
vra3:Title.Translation
    could be subProperty of title
    its still a title for the work

but

vra3:Title.Series
   Not a subproperty
   Would seem preferrable to link to the 
   "series" description
vra3:Title.LargerEntity
   Not a subproperty - this isn't a title for the work"

So my question here is can we solve this using the approach Dave Reynolds
suggested i.e. create a property called qualifier, which has a subproperty
called subproperty, and then use subproperty in the schema in the cases
where a qualifier indicates subproperty relationships, but qualifier in
other cases? If not, what other approaches could we take?

3. The next step is to add a DC mapping, based on the information in the VRA
Core 3.0 specification (see John's slides). I have created a version of the
schema with a naive attempt at mapping to DC, although this has a number of
problems also pointed out by Andy:

"The var3 mappings to DC also need thinking about 

vra3:measurements is defined to map to dc:format
  measurements.{dimensions,format,resolution}
is about the image (actually about the work or about the image)

vra3:material is defined to dc:format but is about the 
substance of the work."

So the next step is to review the Artstor to DC mapping?

I enclose the current versions of the stylesheet, the schema with no DC
mapping (artstor_nodc.rdfs), and the schema with a DC mapping. They can also
be found in the SIMILE CVS. 

For anyone who is interested, one way to examine the schemas and the
transform is to 
1. download Protege (http://protege.stanford.edu) and the RDF(S) plugin
2. use a transform engine like Saxon (http://saxon.sourceforge.net) to style
the example artstor data to RDF/XML
3. load the schema and the artstor data it Protege. 

Note I encountered problems in Protege 1.9 because it couldn't cope with RDF
datatyping, so I removed the datatyping from the styled RDF/XML by hand, for
the height, width and creation_date properties. This is related to Andy's
point about whether we should use datatyping. In terms of the schema /
instance data, I think we are okay using it for Artstor (although the proof
will be in the Artstor instance data) as the artstor XML Schema indicates
these fields are coming from a database where they are datatyped to integers
and dates respectively. However Andy makes a valid point as the VRA Core
spec does not mandate a particular format or datatype for these elements. 

One practical problem is not all tools (e.g. Protege) support datatying, so
it may be desirable to omit it - does anybody know if Protege 2.0 supports
datatyping? 

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/

Received on Thursday, 9 October 2003 10:44:19 UTC