- From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
- Date: Sun, 04 May 1997 19:31:03 GMT
- To: w3c-sgml-wg@w3.org
I would like to develop a protocol (?manifesto :-) for the use of WF-XML information components ('tagsets', 'DTD fragments') and hope that I can explain my position here. I feel sufficiently confident of the potential value of this that I am announcing (later today) the Virtual School's (practical) offering in this area, an 8-week virtual course on: "Scientific Information Components using Java and XML" at the VSMS site. I apologise for 'advertising', but the spinoff will be that all the discussion will be publicly archived and therefore may be of benefit to the members of this group and the ERB. Although some of this may appear off-topic, the key point (which is essential for XML) is how do we reliably and robustly resolve semantics in XML without a DTD (as I've said, a DTD isn't much use to my community). This is not meant to be confrontational - it's simply opening up a huge field which hasn't been available to XML/SGML up to now. <BASELINE> My approach is aimed at people who are HTML-proficient, but have never seen SGML or XML before. I shall refer to this process and people as HTML2XML (H2X). I shall use an 'extremely simple' subset of XML. <AXIOM> H2X are capable of understanding simpleXML. </AXIOM> <AXIOM> H2X can be persuaded of the value of well-formed documents. They care about errors. </AXIOM> <AXIOM> The formal DTD plays no CLIENT-SIDE part in this process. </AXIOM> <AXIOM> Architectural Forms, which might form part of an alternative approach, are not yet part of XML and neither I nor my community understand them yet or have tools to use them. </AXIOM> </BASELINE> <PROPOSAL> We define a protocol for authenticating Well-formed documents (AWF). This includes - identification of the DTD (or more likely DTDs) whose tagsets are used in the document. - certification that the document is well-formed and a *transmission* validation mechanism client-side (e.g. checksum - I'm not an expert here) - a means of stamping the document as AWF - means of resolving those mechanisms used to apply semantics. These might include: - textual annotations of the DTD - glossary resolution of terms (e.g. Virtual HyperGlossary) - algorithm/java resolution of algorithmic semantics (URL-based *.class) </PROPOSAL> <FOOTNOTE> This depends on having a mechanism for resolving resources PUBLIC/URNs, etc. Without that XML may rapidly be simply seen as a transport protocol IMO. </FOOTNOTE> When the client receives a document it is stamped as AWF-XML. This means that *no XML validation is required*. (The XML validation can be done author-side, server-side or produced by a database engine. Whether XML wishes to prescribe the mechanisms for stamping I leave unanswered ... but I think it should define that there *are* such a mechanisms.). The client validates that *no transmission error* has occurred. The client then reads/processes/feeds_to_nuclear_reactor the document. If the document fails its AWF certification, then goto dogshit/error_recovery/extermination. Let's assume it passes. The client needs to resolve the semantics. It/she *may* be quite happy to read the document as is (after all there are (hopefully) WF elements in *this* document. Do you need a DTD to impose the semantics?). More likely they need to know what DTDs were used to define those tags. At present JUMBO uses the DOCTYPE statement to identify the DTD used. That's ALL it does with the DOCTYPE. However it then tries to resolve the resources belonging to that DTD. For example, with <!DOCTYPE PLAY> it loads PLAY.class which contains STAGEDIR.class and other tag-related semantics. It loads it by looking up the address where PLAY.class is stored - in principle this can be determined by CATALOG, URL/Ns, or any of the other mechanisms we have debated. I think it is not unreasonable to assume we can have a resolution mechanism for (say) <!DOCTYPE CML "-//Chemical Markup Language//DTD//EN" ...> which retrieves CML.class and associated *.class In the case of CML there is one *.class per Element, so that (say) <CRYST> has its behaviour determined by pmr.cml.CRYST.class, XLIST has pmr.tecml.XLIST.class and so on. So we have a mechanism for adding semantics on a per-tag basis. The behavior of XLIST is determined by the *.java. This is NOT a hardcoded solution as it's possible to override the XLIST.class with your own. Whether this is allowed depends on the community. It looks to me at least as robust as implementing semantics from a non-annotated DTD or from a textual/prose annotation which may often be difficult to obtain. [For example, when I got Panorama, I got lots of DTDs, but - at least the way I got it - there was not a per-element documentation of each of them. And, even with such a well used DTD as HTML, I have had the greatest difficulty in determining the semantics of REL, REV and META. META, for example, is a license to issue semantically unresolved NAME/VALUE pairs. (Yes, I know and approve of the Dublin Core effort :-)] Now let's consider 'tag soup' derived from more-than-one-DTD. Suppose a chemist writes a poem about a compound - it happens occasionally. It would be pointless revising TEI to accommodate chemistry, and vice versa for CML to accommodate verse. So I might write the XML below (the tagset is a mixture of 'A gentle Introduction to SGML (<STANZA>, <LINE>) and CML. I forget the author - it was a Christmas competition from the Chem Soc about 30 years ago, I think. <XLIST> <POEM> <STANZA> <LINE>A mosquito was heard to complain</LINE> <LINE>The chemists have poisoned my brain</LINE> <LINE>The cause of his sorrow</LINE> <LINE>was <XPTR="DDT" ACTUATE="USER">para-dichloro</XPTR></LINE> <!-- apolgies if the syntax of XPTR is wrong --> <LINE>diphenyl-trichloro-ethane</LINE> </STANZA> </POEM> <MOL ID="DDT"> <FORMULA> <XVAR CONVENTION="SMILES">C(Cl)(Cl)(Cl)C(c1ccc(Cl)cc1)c1ccc(Cl)cc1</XVAR> </FORMULA> </MOL> </XLIST> (Note that since XLIST has a content model of ANY in TecML, the above would be *valid* CML so long as the TEI DTD was included in TecML. Forgive me if STANZA, etc. are not part of TEI...) This is slightly artificial, but the need to combine components of different DTDs will be enormous. The document above is WF, the only question being how to resolve the semantics, since it doesn't come from a single DTD. <REQUIREMENT> It should be possible to label any Element in an XML document with knowledge of which DTD it belongs to, and enough ancilliary information to resolve it. </REQUIREMENT> So a possible solution would be something like: <MOL DTD="http://www.vsms.nottingham.ac.uk/vsms/pmr/cml/MOL.class"> which uniquely defines the semantics of the components under MOL. This could be made more pleasing to the eye by using entities... I accept that AFs and other external mechanisms of imposing structure are probably 'better', but we don't have them in XML. I am also a great believer in putting everything in a single document where possible, particularly for archival. Thus, it would make sense to append all glossary entries (in VHG format) to the end of a document which linked to them. in 50 years time it would still be possible to see what was meant by a term. The benefit of the approach I've suggested it that it's fairly self-evident for an H2X reading the document to see what is going on. The WF document has a meaningful structure, and wherever necessary there are pointers to resources which resolve the semantics. If really critical some of these could be expanded into the document. AFAICS it works except for the namespace problem for tags. If we agree on a hierarchical convention for tags (e.g. DTD.GI) then it seems we can code that. If we *don't* do something about it, then we shall either: - be overwhelmed by clashing tag-soup - have no effective interoperability - build more and more complex DTDs - adopt a complex solution which some of us may struggle to understand Anyway, I shall go ahead with a DTDless-WF approach and see how easily H2X people accept it :-) P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/
Received on Sunday, 4 May 1997 14:33:50 UTC