Information components (DTD fragments) from Peter Murray-Rust on 1997-05-04 (w3c-sgml-wg@w3.org from May 1997)

From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
Date: Sun, 04 May 1997 19:31:03 GMT
To: w3c-sgml-wg@w3.org
Message-Id: <6135@ursus.demon.co.uk>
I would like to develop a protocol (?manifesto :-) for the use of
WF-XML information components ('tagsets', 'DTD fragments') and hope that
I can explain my position here.  I feel sufficiently confident of the
potential value of this that I am announcing (later today) the Virtual
School's (practical) offering in this area, an 8-week virtual course on:
	"Scientific Information Components using Java and XML"
at the VSMS site.  I apologise for 'advertising', but the spinoff 
will be that all the discussion will be publicly archived and therefore may 
be of benefit to the members of this group and the ERB.

Although some of this may appear off-topic, the key point (which is 
essential for XML) is how do we reliably and robustly resolve semantics
in XML without a DTD (as I've said, a DTD isn't much use to my community).
This is not meant to be confrontational - it's simply opening up a huge
field which hasn't been available to XML/SGML up to now.

<BASELINE>
My approach is aimed at people who are HTML-proficient, but have never seen
SGML or XML before.  I shall refer to this process and people as HTML2XML 
(H2X).  I shall use an 'extremely simple' subset of XML.
<AXIOM>
H2X are capable of understanding simpleXML.
</AXIOM>
<AXIOM>
H2X can be persuaded of the value of well-formed documents.  They care about
errors.
</AXIOM>
<AXIOM>
The formal DTD plays no CLIENT-SIDE part in this process.
</AXIOM>
<AXIOM>
Architectural Forms, which might form part of an alternative approach, are 
not yet part of XML and neither I nor my community understand them yet or 
have tools to use them.
</AXIOM>
</BASELINE>

<PROPOSAL>
We define a protocol for authenticating Well-formed documents (AWF).
This includes
- identification of the DTD (or more likely DTDs) whose tagsets are used
in the document.
- certification that the document is well-formed and a *transmission*
validation mechanism client-side (e.g. checksum - I'm not an expert here)
- a means of stamping the document as AWF
- means of resolving those mechanisms used to apply semantics.  These might
include:
	- textual annotations of the DTD
	- glossary resolution of terms (e.g. Virtual HyperGlossary)
	- algorithm/java resolution of algorithmic semantics (URL-based 
		*.class)
</PROPOSAL>
<FOOTNOTE>
This depends on having a mechanism for resolving resources PUBLIC/URNs, etc.
Without that XML may rapidly be simply seen as a transport protocol IMO.
</FOOTNOTE>

When the client receives a document it is stamped as AWF-XML.  This means
that *no XML validation is required*.  (The XML validation can be done 
author-side, server-side or produced by a database engine.  Whether XML 
wishes to prescribe the mechanisms for stamping I leave unanswered ... but 
I think it should define that there *are* such a mechanisms.).  The client
validates that *no transmission error* has occurred.

The client then reads/processes/feeds_to_nuclear_reactor the document.
If the document fails its AWF certification, 
then goto dogshit/error_recovery/extermination.  Let's assume it passes.

The client needs to resolve the semantics.  It/she *may* be quite happy
to read the document as is (after all there are (hopefully) WF elements
in *this* document.  Do you need a DTD to impose the semantics?).  More
likely they need to know what DTDs were used to define those tags.

At present JUMBO uses the DOCTYPE statement to identify the DTD used.  That's
ALL it does with the DOCTYPE.  However it then tries to resolve the resources
belonging to that DTD.  For example, with <!DOCTYPE PLAY> it loads
PLAY.class  which contains STAGEDIR.class and other tag-related semantics.
It loads it by looking up the address where  PLAY.class is stored - in 
principle this can be determined by CATALOG, URL/Ns, or any of the other
mechanisms we have debated.  I think it is not unreasonable to assume we
can have a resolution mechanism for (say) 
<!DOCTYPE CML "-//Chemical Markup Language//DTD//EN" ...>
which retrieves CML.class and associated *.class  In the case of CML there
is one *.class per Element, so that (say) <CRYST> has its behaviour determined
by pmr.cml.CRYST.class, XLIST has pmr.tecml.XLIST.class and so on.

So we have a mechanism for adding semantics on a per-tag basis.  The behavior
of XLIST is determined by the *.java.  This is NOT a hardcoded solution
as it's possible to override the XLIST.class with your own.  Whether this
is allowed depends on the community.  It looks to me at least as robust
as implementing semantics from a non-annotated DTD or from a textual/prose
annotation which may often be difficult to obtain. [For example, when I got
Panorama, I got lots of DTDs, but - at least the way I got it - there was
not a per-element documentation of each of them.  And, even with such a 
well used DTD as HTML, I have had the greatest difficulty in determining the
semantics of REL, REV and META.  META, for example, is a license to issue 
semantically unresolved NAME/VALUE pairs.  (Yes, I know and approve of the 
Dublin Core effort :-)]

Now let's consider 'tag soup' derived from more-than-one-DTD.  Suppose a 
chemist writes a poem about a compound - it happens occasionally.  It would
be pointless revising TEI to accommodate chemistry, and vice versa for CML
to accommodate verse.  So I might write the XML below (the tagset is a 
mixture of 'A gentle Introduction to SGML (<STANZA>, <LINE>) and CML.  
I forget the author - it was a Christmas competition from the Chem Soc 
about 30 years ago, I think.

<XLIST>
<POEM>
<STANZA>
	<LINE>A mosquito was heard to complain</LINE>
	<LINE>The chemists have poisoned my brain</LINE>
	<LINE>The cause of his sorrow</LINE>
	<LINE>was <XPTR="DDT" ACTUATE="USER">para-dichloro</XPTR></LINE>
<!-- apolgies if the syntax of XPTR is wrong -->
	<LINE>diphenyl-trichloro-ethane</LINE>
</STANZA>
</POEM>
<MOL ID="DDT">
<FORMULA>
<XVAR CONVENTION="SMILES">C(Cl)(Cl)(Cl)C(c1ccc(Cl)cc1)c1ccc(Cl)cc1</XVAR>
</FORMULA>
</MOL>
</XLIST>

(Note that since XLIST has a content model of ANY in TecML, the above 
would be *valid* CML so long as the TEI DTD was included in TecML.  Forgive
me if STANZA, etc. are not part of TEI...)

This is slightly artificial, but the need to combine components of different
DTDs will be enormous.  The document above is WF, the only question being how
to resolve the semantics, since it doesn't come from a single DTD.

<REQUIREMENT>
It should be possible to label any Element in an XML document with knowledge
of which DTD it belongs to, and enough ancilliary information to resolve it.
</REQUIREMENT>

So a possible solution would be something like:
<MOL DTD="http://www.vsms.nottingham.ac.uk/vsms/pmr/cml/MOL.class">
which uniquely defines the semantics of the components under MOL.  This
could be made more pleasing to the eye by using entities...

I accept that AFs and other external mechanisms of imposing structure are
probably 'better', but we don't have them in XML.  I am also a great 
believer in putting everything in a single document where possible, 
particularly for archival.  Thus, it would make sense to append all glossary
entries (in VHG format) to the end of a document which linked to them.  in
50 years time it would still be possible to see what was meant by a term.

The benefit of the approach I've suggested it that it's fairly self-evident
for an H2X reading the document to see what is going on.  The WF document has 
a meaningful structure, and wherever necessary there are pointers to
resources which resolve the semantics.  If really critical some of these 
could be expanded into the document.

AFAICS it works except for the namespace problem for tags.  If we agree
on a hierarchical convention for tags (e.g. DTD.GI) then it seems we
can code that.  If we *don't* do something about it, then we shall either:
	- be overwhelmed by clashing tag-soup
	- have no effective interoperability
	- build more and more complex DTDs
	- adopt a complex solution which some of us may struggle to understand

Anyway, I shall go ahead with a DTDless-WF approach and see how easily 
H2X people accept it :-)  

	P.


-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/
Received on Sunday, 4 May 1997 14:33:50 UTC