wp5 notes on schema mapping from Dan Brickley on 2003-04-02 (public-esw@w3.org from April 2003)

From: Dan Brickley <danbri@w3.org>
Date: Wed, 2 Apr 2003 03:59:50 -0500
To: public-esw@w3.org
Message-ID: <20030402085949.GB17126@tux.w3.org>
(recovered belatedly from my laptop; from a meeting at Stilo on Wp5)


Questions
---------

Is there a line between 'multi-namespace chaos' that RDF is good for, vs 
static, tightly controlled homogenous namesaces, where schema annotation stops making sense?

If you have a data model, how to get to yr concrete xml schema encoding, plus 
whatever annotations are needed to get round trip? (wizard)

sb "writing schemas is quite difficult. People tend to think about instances and 
then work backwards. People get started by creating instances and then 
reverse-engineering the schema."
db "do they do it well?" 
s "not bad..."

purchase orders
 buyers, sellers, items
 a po has one or more items
... this tells us nothing about what a po document tells us

looking at po

...issue of metadata about the doc, header info etc., which 
tends to come at the beginning of the document considered as a
tree, and mixes in with the 'data proper'.


from po.xml
  <purchaseOrder orderDate="1999-10-20">
  <shipTo country="US">
  <name>Alice Smith</name> 
  <street>123 Maple Street</street> 
  <city>Mill Valley</city> 
  <state>CA</state> 
  <zip>90952</zip> 

  </shipTo>

if po.xml is an xml document, ie a member of class eg:PODoc, 
is shipTo an attribute/property of that thing, or of some other
entity which the eg:PODoc describes?

ie. 
<rdf:Description rdf:about="po.xml">
 <eg:shipTo>
   <rdf:Description>
     <eg:country>US</eg:country>
      ....


<rdf:Description rdf:about="po.xml">
 <x:descriptionOf>
 <eg:PurchaseOrder>
   <eg:shipTo>
     <rdf:Description>
       <eg:country>US</eg:country>
        ....
   </>




[brian arrives]


Can we write po.xsd po.dtd po.rng po.xtr
...so that they hve the same 'extensions' ie pick out 
the same xml docs as valid instances?

given a dtd, can produce an xsd with the same class extension

for some xsds, can do reverse?
q: namespace prefixes, for example.

brian: 
 the class of classes you can describe within dtds is entirely 
 contained within xsd.

taking anything in the dtd class you can do a purely syntactic 
change to get the dtd equivalent. 
A mapping, 

	D: dtd -> xsd
	such that
	L(d) = L(D(d))

	(L being legal extension / Language)

brian/s agrees


	S: xsd -> dtd
	such that 
	L(s)  subsetof<  L(S(d))
     &  forall d' in 


mapping such that set of xsd describable is ... of dtd


counter example: character Entities

(see mathml, html for eg)

can we think of this as a preprocessing stage?

xsd's view: you can have a dtd as well as a schema, for entity stuff




<!ELEMENT eg:thing>

we can write an xml schema that accepts xmlns:eg 
but it will also accept xmlns:eg2, so long as namespace URIs are
same.

any document thats acceptable via a dtd, we can have a 
schema that generates exactly the same extension. (?except ns)

S: if there are no namespaces in the dtd, we can generate a schema
that has exactly same extension.



<n:a xmlns+:n="myurl">
  <n:b ...
  </n:b>
</n:a>


<m:a xmlns+:n="myurl">
  <n:b ...
  </n:b>
</m:a>
(or ommitting the ns decl)
  
Are these equal, equiv etc in any 
sense? 

(generate same psvi, for eg? or same infoset?)
(xml c18n same?)

eg. ' vs "


how far in this direction does xml canonicalisation go?
(@@todo)

'there is some strange sense in which these are the
same document. unfortunately you can write a 
dtd that accepts one and rejects the other'.

Do these have the same PSVI?
problem: PSVI is tech specific to XML Schema.
We want something common across all xml document typing tech.?

Is there some (XML 1.0+ns based) characterisation of the commonality
between these two/three/etc documents?

What is common across the whole space? 

PSVI is stated only as 'what happens for xml schemas', perhaps there
 should be a generalised statment of this, ie. that the above 2 egs
give the same canonicalised-in-some-sense representation.


From an RDF perspective, we need to get from infosets to 
a set of RDF statements about the world, and then ask
whether the two sets have the same truth conditions / make 
the same claims about the world. Could one be false while the 
true... etc


==========


sticking with po.xml and po.xsd

we want to be able to _generate_ a sensible po.xsd
but starting from our uml/rdf/etc model

triples:

(i) payload of the instance data as rdf statements
(ii) rdf schema statements (implied by the instance data
   (property, class skeletal definitions; that the domain includes ...)
(iii) more statements, not implied by instance data, 
      that give domain/range for these properties
(iv) statements about classes of xml document, eg. 
   a PODocType?


If we....

 - have an ontology/schema for purchase order world
 - have picked a schema language (XSD)
 - have picked a serialization strategy / xml writing convention 
   (eg. no atrtibutes, edges-encode-properties)
 - (anything else?)

...what do we need before we can (auto)generate an XSD?

- need to choose a root class (or is this arbitrary?)
- need one root class from each disconnected segment...
(because serializer could be serializing a disjoint graph)

The classse and properties may be disconnected at schema level
...also
The individuals and relations may or may not be connected.


"although this is same as in rdf, someone looking at the 
instance data may be puzzled if it 'starts in wrong place'".

eg. if shipsTo has the PO xml-inside it.

The property/edge/element names encode assumptions about directedness, 
and about the use of the document.

Example from Professional XML Schema book, re RDBMS mappings:
ch12 creating XML Schema from existing databases.

...generate several different xml schemas from same data, for 
different purposes.

RDF selling pt: its an account of what all the instance data from
 these various instance formats have in common.

<e:Document dc:title="...">
 <e:author>
  <e:Person foaf:name="Tim">

...this couples our serialisation strategy to choice of 
namespace / vocab.

<e:Document dc:title="...">
 <e2:wrote x:map="inverse">
  <e:Person foaf:name="Tim">

...we're free (in princple) to do this. But its ugly 
and not typical colloquial XML.

We can say in OWL

<rdf:Property rdf:about="http://example.com/e#author">
  <owl:inverse rdf:about="http://example.com/e2#wrote"/>
</rdf:Property>


Hypothesis: people create vocabulary (xml elements and hence implied 
RDF properties, if we take a naive mapping appropach)
...where they start with classes they're more concerned about, and 
put inside their xml-encoded descriptions mentions of instances of 
less interesting-to-them classes.
So, a library might have Document at the top of the xml tree,
which leads them to use an 'author' relation.

A white pages directory, might start with people ,and have 
a 'wrote' relation pointing to docs.

This relates to expected search strategies
 
 - do i look for papers written by ?
 or 
 - documents about ?




Serialization strategy depends on expected usage.

We're generating from RDF world, an annotated XSD which 
includes hints, mapping rules, xslt etc that lets us get our 
RDF out again.

We could generate:

<e:Document dc:title="...">
 <e2:wrote x:map="inverse">
  <e:Person foaf:name="Tim">

or even (though evil)

<e:Document dc:title="...">
 <e2:wrote>
  <e:Person foaf:name="Tim">


or 
 <e2:wrote>
 <e:Book foaf:name="Timetable">
 <e:Person foaf:name="Tim">
 </e2:wrote>

or

 <e2:wrote>
 <e:Person foaf:name="Tim">
 <e:Book foaf:name="Timetable">
 </e2:wrote>

 <s:claim>
 <e2:wrote/>
 <e:Person foaf:name="Tim">
 <e:Book foaf:name="Timetable">
 </s:claim>
 <!-- polish form -->


 <s:claim>
 <s:rel reluri="e2:wrote"/>
 <e:Person foaf:name="Tim">
 <e:Book foaf:name="Timetable">
 </s:claim>

 <rdf:Statement>
  <rdf:predicate rdf:resource="http://example.com/e#wrote"/>
   <!-- ... -->
  </rdf:Statement>


OpenMath adopts a similar very generalised style.

 <s:claim>
 <s:rel reluri="e2:wrote"/>
  <s:obj objuri="e:Person" foaf:name="Tim"/>
 </s:claim>

...things become very regular, and data is pushed into 
content rather than markup.

Similar strategy seen in RDF SQL triplestores, where
the anticipated schema becomes general, and the content
does all the work.

"Deep embedding"





Looking at Henry's work:

Q: how tied to XML Schema is this? eg. need for PSVI... maps on types as well as 
elements and attributes.

Q for Henry: in po-mapped.xml why 
- <ns_2:shipTo xmlns:ns_2="" country="US" map:item-to="property" map:item-name="" map:minOccurs="" map:maxOccurs="" map:type-to="" map:type-name="{}type.Address.1096">
...is country still an attribute, not normlaised 


Re the generated Java, what's the purpose?  Why aren't property names apparent?
Why not use Java classes more explicitly?


What's the value of creating mapping to java objects, versus using Java interfaces 
to the original data, XML (SAX, DOM), RDF etc?

Comparison: SOAP serializers that dump Java OO stuff into XML (-> WP5)

notes: Schema Adjunct can map to SQL...



NExt steps:

make the report page into a table of contents. Separate docs for dan, brian, stephen

aim to release draft for review in 2 weeks time.

Next meeting: feb 13th, review and publish meeting. Stilo 10.15am 2003-02-13.


Possible Stilo staff: Steve Healey


Examples / test data:

 - PO and other Edinburgh stuff (quicken?)
 - Doc/Person/wrote example + illustration (also bibliography/RAL)
 - Wine ontology simple egs. (8 line DTD)
 - projects/people/docs

more real world examples:
 - danbri: wsdl, rss, calendar (ongoing not per feb deadline)
 - ral: cerif (common euro research info format sql/xml and rdf reps)
Received on Wednesday, 2 April 2003 03:59:50 UTC