ISSUE: StructuredDatatypes

The issue of the handling of structured (e.g complex XML and multimedia)
datatypes by the Ontology Web Language was raised at the A'dam F2F. I have
written it up, and given it a URI at
http://www.openhealth.org/WOWG/IssueStructuredDatatypes

Integrating structured (e.g. XML and multimedia) datatypes into the ontology
web language falls within the charter and an explicit requirement of the
Ontology Web Language.

Dan Brickley has posted a terrific (IMHO) summary of the requirement:
http://lists.w3.org/Archives/Public/public-webont-comments/2002Apr/0004.html

The XML and before that, the SGML, communities have had a long interest in
the graphical representation and manipulation of structured, including
multimedia, information, which has been called "Groves" (Graphical
Representation Of property ValuES) [1,2]. Such representations lead
themselves naturally as RDF descriptions [3]. It has been the explicit hope
that an RDF Schema description of the XML Infoset will allow "validation" of
an RDF/Infoset representation of an XML document [4].

I propose that WebOnt accept this challenge (my preliminary work suggests
that we are up to the task). Integrating XML and XML Schema datatypes in
this fashion will provide a concrete and tangible benefit provided by OWL to
the XML community, as well as properly allowing OWL to reason about
structured XML and multimedia datatypes.

RDF Core is developing MT extensions for simple or concrete datatypes. The
proposal which is outlined below is not a duplication of this effort, rather
directed at complex or structured datatypes. I will discuss why the approach
taken by RDF Datatypes, while perfectly reasonable for concrete datatypes,
cannot be directly extended to structured datatypes, primarily due to some
technical details with respect to XML Schema.

To summarize:
1) There is a desire to incorporate and reason about structured datatypes
(e.g. XML Schema complexTypes)
2) RDF Datatypes, and by extension OWL's DatatypeProperty deals with
concrete or string based datatypes (e.g. XML Schema simpleTypes). A
preliminary WD is at http://www-nrc.nokia.com/sw/rdf-datatyping.html
3) Technical issues involved with integration of general XML types, XML
Schema datatypes  and XQuery formal types are discussed below.
4) A proposed solution to the problem of integrating general XML Schema
datatypes is presented.

Issues involved with integration of XML types and XML Schema datatypes into
OWL:

In a perfect RDF world there would be a URIreference for each XML Schema
type (otherwise known as an XML Schema particle). It turns out that XML
Schema has defined URIs for a fixed set of basic datatypes but this involves
doing a bit of weirdness with internal XML subsets and labelling these
specific XML Schema particles with "id"s. Suffice to say that 99.9% of XML
Schemas in the wild don't go to this effort, nor should our solution mandate
it. See http://www.w3.org/2001/XMLSchema.xsd for details.

For those of you at home, XML Schema type names are XML QNames (e.g.
xsd:string) and at face value it should be, and is, possible to derive a
URIreference from a QName, the problem being that an XML Schema may use the
same QName for each of an element, attribute, simple and complex type
definition. That is the QName does not uniquely define an XML Schema
particle.

RDF Datatypes assume XML Schema simple types, so for this specific purpose a
URIreference would work -- although there is nothing in the XML world
connecting an XML Schema particle name="foo" attribute value to a
URIreference but that is another issue.

XML Schema's overloading of particle names was an explicit design decision
taken directly from how XML 1.0 itself defines types and type names. XML 1.0
(http://www.w3.org/TR/REC-xml) defines an element type as the GI or name of
the element. Element and attribute names are, however, not disjoint. e.g.
the following is perfectly legal XML:

<foo foo="12345" />

An attribute itself has a type, either CDATA which is text, ID which is a
unique identifier, IDREF whose values reference an element with such a
uniquely identifying attribute, NMTOKEN which provides constraints on the
string (e.g. no whitespace), NMTOKENS which allows multiple NMTOKENS, IDREFS
etc.

It is apparent that creating a URIreference by composing an XML document's
base URI with the element or attribute name will not uniquely identify the
element or attribute type definition i.e. the part in the DTD or document
type definition (this is because elements and attributes share symbol
spaces). This has been carried over to XML Schema.

In XML Schema:

<xsd:element name="foo" />
<xsd:attribute name="foo" />
<xsd:simpleType name="foo" />
<xsd:complexType name="foo" />

are all allowed in the same schema, indeed:

<xsd:element name="foo" type="foo" />

defines an element "foo" which has a type defined by the complex type whose
name="foo".

XML Schema does however define a type heirarchy, and it is the goal of this
proposal to seemlessly integrate the XML Schema type heirarchy into the OWL
class heirarchy. Indeed an XML Schema processor, which accepts an input XML
infoset and adnorns it with types (and other bits of information) to produce
a "post schema validation infoset" or PSVI in XML Schema terms, can be seen
as a specialized 'classifier' that operates on 'StructuredProperty' values.

A proposed solution

Class membership of instances can be represented by a subClassOf
relationship between the class composed of a single individual and a
particular super class. An individual represents some particular RDF graph.
In the case of an XML document, or part of an XML document, there exists an
Infoset representation. The infoset is modelled as an RDF graph in a very
straightforward fashion. Indeed a simple XSLT transform converts an
arbitrary XML document into the RDF graph form (e.g.
http://www.openhealth.org/WOWG/XMLtoSchema.xsl)

Any of an XML Schema [5], or XQuery formal type [6], or other schema
language represented as a DOM Abstract Schema [7], may represent constraints
on a particular piece of XML such that the type defines a class whose
instance set is the set of XML data values whose Infoset conforms to the
constraints defined by the type declaration.

As such, one can develop, in principle, an OWL class definition such that
instances of infoset graphs which represent pieces of XML conforming to a
particular type, are members of the class.

This work has begun by the development of XSLT transforms that transform
instances of XML Schemas and XQuery formal language type declarations into
OWL Class definitions: http://www.openhealth.org/WOWG/XSDtoSchema.xsl and
http://www.openhealth.org/WOWG/RNGtoSchema.xsl -- although these transforms
are not yet complete, this serves as an outline of how the proposed solution
would work and how an OWL processor might actually go about deciding, for
example, whether a piece of XML does belong to a particular class. It should
be noted that this approach will work both for classes defined by an XML
Schema QName as well as classes written directly in OWL.

         Jonathan Borden, M.D.
         Assistant Professor of Neurosurgery
         Tufts-New England Medical Center
         Boston MA
   jonathan@openhealth.org

         References:

[1] Groves: http://www.oasis-open.org/cover/groves.html
 [2] Groves illustrated: http://www.cogsci.ed.ac.uk/~ht/grove.html
[3] The XML grove as RDF: http://www.openhealth.org/XSet
[4] RDF Schema for XML Infoset http://www.w3.org/TR/xml-infoset-rdfs
[5] XQuery formal semantics: http://www.w3.org/TR/query-semantics/
[6] XML Schema Part 1: http://www.w3.org/TR/xmlschema-1
[7] DOM Level 3 Abstract Schema:
http://www.w3.org/TR/DOM-Level-3-ASLS/abstract-schemas.html

Received on Saturday, 13 April 2002 13:42:53 UTC