Re: DTD Fragments and XML from Peter Murray-Rust on 1997-05-03 (w3c-sgml-wg@w3.org from May 1997)

From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
Date: Sat, 03 May 1997 20:08:12 GMT
To: w3c-sgml-wg@w3.org
Message-Id: <6101@ursus.demon.co.uk>
In message <336B24F3.CA442B10@calum.csclub.uwaterloo.ca> Paul Prescod writes:
> Let's say that I am an XML user. I am happy with, say, the DocBook DTD,
> but need to insert a chemical formula in CML format. Or perhaps I want
> to insert something more mundane, some small element that does not have
> an expression in DocBook: <GRADE> for a student's grade on a project. In
> the SGML world I would combine the two DTDs manually. This is probably a
> painful process of examining content models and parameter entities and
> finding the right place to shoe-horn in my element type. 

This is an extremely important question to me - I'm not sure whether it's
on topic, so please shut me up if appropriate...  It is an opportunity
to describe where I have arrived at with CML and see whether it represents a
practical way forward. 

CML actually consists of three DTDs:
	HTML2.0 (3.2 when I get around to writing some rendering code or,
		better, finding someone else who has already done it).
	TecML (orig. XML (no relation :-)
	MOL
They have been designed so that that they can be extended in several ways.
(Before everyone tells me I should have been using Architectural Forms and
HyTime for this, please remember I'm still learning about them.  I also have
to use something that is understandable by my community.)  

<AXIOM>
Every tag has to have code to process its semantics.  The greater the variety
of tags the greater the problem.
</AXIOM>
[Of course there are ways of reducing this, such as stylesheets and 
inheritance.  However stylesheets primarily deal with one sort of tag - text -
from which everything else is inherited.  An object like <REACTION> requires
more than current stylesheets deliver.]

The consequence is that there is a direct interaction between the tagset and
the postprocessing code - the DTD is largely irrelevant.  I started off with
a medium hardcoded DTD which I kept relaxing.  For example, can a MOLECULE
contain a PERSON? (ans.  quite easily - the person who made it, or sells it).
Can a PERSON contain a MOLECULE?  (yes, we're made up of them).
Indeed I could not think of useful content models which some application
didn't require me to break later.   Therefore one logical conclusion is that
the DTD looks like:

<!ELEMENT OBJECT (#PCDATA|OBJECT)*>
<!ATTLIST OBJECT NAME CDATA #REQUIRED>

This has a LISP-like feel so I suppose it would have its supporters.

This is too generic for my community so I have tried to construct a few 
generic *Information Components* which apply to the scientific community.
There are, roughly:
	- (hyper) text (inc XML HREF and some XML-LINK)
	- figures (i.e. structured diagrams with internal semantics)
	- images (bitmaps)
	- parsable graphs 
	- parsable tables (unlike HTML3.2 which is often a set of trees, not
		a table)
	- bibliography
	- terminology
	- numeric data of various dimensionality, including UNITS, fuzziness
		etc.
	- relations between objects (XML-LINK)
	- math

and then it would be extended by discipline-specific DTDs (which have to have
their own processing code) like MOL, SMDL, etc.

My hope is that the community will see the benefit of supporting these 
components.  For example a decent HTML3.2 renderer is a key tool.  (I hope
that HTML3.2(XML) will be tweaked in the near future to allow XML-LINK).
Ideally CGM would serve as a tool for structured diagrams (but I haven't
seen a strong groundswell for CGM tools).  The various MATH DTDs (which for
we must include *parsable* math) could be included (though I still have to
work out how to include symbolic variables in text and tables - that is 
dependant on the MATH DTDs).  So TecML would include lines like:

<!ENTITY % isomath 'IGNORE'>
<!ENTITY % w3math 'IGNORE'>

<![%isomath[
<!ENTITY % mathdtd "+//ISO//DTD MATH//EN">  <!-- I don't know the name -->
%mathdtd;
]>

<!-- and similarly for the others if they are to be used as alternatives-->

XML has no mechanism for multiple namespaces (I assume SUBDOC and AFs cater
for this?).  Therefore the scientific/tecyhnical community could try to
avoid namespace collisions *now*.  It may be impossible without more powerful
tools, but it's worth a try.   If each DTD is limited to (say) about 20 tags
it should be possible.  

I started CML by creating two namespaces, e.g.:
	CML.MOL
	XML.VAR      <!-- *my* XML :-) -->
and this avoided the clash with HTML's VAR, P, TITLE, etc.  I see no reason
why this can't be done, but it's a people problem not a technical one.  *if*
XML has a namespace registry, hierarchical tagsets could be very productive.

<PROPOSAL>
Within the XML/W3C effort, hierarchical tagnamespaces should be investigated
</PROPOSAL>
[Now, I been through the effort of trying to get MIME types registered and
fallen at this hurdle, so it's not trivial.  In fact it's quite similar.]
The result could be that CML tags are of the form:
CML.SYMMETRY
whilst maths ones could be
W3MATH.SYMMETRY

If we don't mind long tagnames (and the article of faith says we don't) we 
can use URNs, FPIs, or any other resolving mechanisms in tagnames.

<FOOTNOTE>
A worry that I have with XML is that it won't become liberating except for 
very generic textual applications (where HTML makes quite a good job anyway).  
This leads to orthogonal communities using independent (and
doubtless clashing) tagsets.  E.g. the local store uses RETAIL.DTD, the
bookie uses GAMBLE.DTD, the disco uses MUSIC.DTD and these never intersect.
It comes very close to requiring separate plugins for each of these 
applications and XML is an unexciting part of the transport mechanism.
</FOOTNOTE>

So - are there some brave souls who believe that a significant number of
disciplines can interoperate?  My natural grouping would be the major
learned disciplines (though not neccesarily with the learned socities
in the vanguard :-( ).  Math, chemistry, engineering, medicine, biology,
etc.

As a result of my analysis I have created a meta-like approach to describing
scientific information, which uses only about 6 tags (and this could be
reduced to about 2 if critical).
	- a generic container (XLIST).  Content model ANY
	- a generic variable (XVAR).  Content model #PCDATA
	- a generic n-dimensional array. Content model #PCDATA (essentially
		shorthand for homogeneous XVAR arrays.
and a few support containers (really to help humans, rather than because
they are needed):
	- BIB - a collection of XVARs relating to a citation
	- TERMENTRY - a glossary entry using XVARs from ISO 12620
	- PERSON - a person
Foreign objects (e.g. images) are handled by XNOTATION

To have any value these objects must have semantics resolved.  (*DTDs are
no use at resolving semantics and give no help as to how to do it*.  For
my purposes, therefore, DTDs are largely irrelevant.	
The basic approach is to point these objects at *glossaries*, and this is
now implemented in JUMBO.  For example, instead of using <GRADE> it would be
attractive to refer to the precise terminology for Grade from an 
authorititative body.  Thus the IEEE is in the early stages of developing a
set of standards for distance education, so this concept would be ideally
represented by something like:

<XVAR UNIT="percent" CONVENTION="IEEE.P1484" DICTNAME="GRADE">67</XVAR>

JUMBO can retrieve the glossary contents (e.g. for UNITS) and apply them.
In this way we have a potentially universal system for XML, which is
scalable and robust and maintained independently and distributed by people
who care.  [We are developing this concept in the Virtual HyperGlossary
(which is fundamental to JUMBO) (http://www.venus.co.uk/vhg).]

The only technical problem is glossary resolution, and that seems to me to
be identical with the problem of stylesheets and CATALOG/PUBLIC/SYSTEM, etc.
View a glossary as a stylesheet - a set of terms with associated definitions
and methods - and the problem seems eminently tractable.  IMO it's one
of the simplest ways to get the XML-friendly community working together

> 
> I don't think that we can really expect that to happen often in the XML
> world. People will just remove the DOCTYPE line and depend on
> well-formedness. But having removed the DOCTYPE line, they have now
> taken all responsibility for the semantics of that document upon
> themselves. It can no longer be validated. The user agent cannot use any

IMO validation by DTD is almost irrelevant for my problems - what can it
add?  I'm busily removing all my attribute namegroups - I would be arrogant
if I thought I'd got those right for all time - and replacing them by CDATA.
This means that new attribute values are added to the postprocessor and editor
(though not directly to the DTD).

**This means that validation is done by the postprocessor**  - in my case the 
Java class for each element.  For example, a molecule is invalid if it
has a different number of atomic x-coordinates from y-coordinates, and it's
impossible for the DTD to specify that.  So complicated Elements have 
1500 lines of Java code or which quite a lot is 'validation'. 


> "hard coded" knowledge of the semantics (with no doctype it doesn't know

If there is a resolution mechanism as above then the semantics don't have
to be hardcoded and can be added automagically by downloading Java classes.

> what namespace the gis are from). "Alternate" stylesheets (e.g. text to
> speech) are no longer useful (same problem). Search engines cannot
> depend on the meta-data to be accurate. In short, I've hobbled the
> interoperability of the document. We are all hurt by "tag soup", but
                                                        ^^^^^^^^^

We are only hurt if we don't approach it collaboratively.  IMO information
components will become more important than DTDs and if we can stop
them clashing, then I think XML has an enormous future for DTD-less working.

> perhaps the visually impaired are the most hurt by it. I'm sure they
> don't want to surf the web by continually reconfiguring their browsers
> from scratch because authors have not properly declared their
> namespaces.
> 
> In my mind this is where the rubber meets the road. The rubber is
> generic markup, where the needs of the data are paramount: add an
> element if you need it. The road is the Internet where interoperability
> is paramount. It seems to me that in XML it is too hard to balance these
> factors.
> 
> I don't think that we can make it easy to combine DTDs without changing
> SGML. But maybe we can figure out a way to declare a namespace for
> elements: to "import" element names in a standard way. You wouldn't be
> able to validate the document but at least it would be clear what the
> elements MEAN.

I think we are in complete agreement here - and it's an exciting possibility.

	P.


-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/
Received on Saturday, 3 May 1997 16:27:06 UTC