- From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
- Date: Sat, 03 May 1997 20:08:12 GMT
- To: w3c-sgml-wg@w3.org
In message <336B24F3.CA442B10@calum.csclub.uwaterloo.ca> Paul Prescod writes: > Let's say that I am an XML user. I am happy with, say, the DocBook DTD, > but need to insert a chemical formula in CML format. Or perhaps I want > to insert something more mundane, some small element that does not have > an expression in DocBook: <GRADE> for a student's grade on a project. In > the SGML world I would combine the two DTDs manually. This is probably a > painful process of examining content models and parameter entities and > finding the right place to shoe-horn in my element type. This is an extremely important question to me - I'm not sure whether it's on topic, so please shut me up if appropriate... It is an opportunity to describe where I have arrived at with CML and see whether it represents a practical way forward. CML actually consists of three DTDs: HTML2.0 (3.2 when I get around to writing some rendering code or, better, finding someone else who has already done it). TecML (orig. XML (no relation :-) MOL They have been designed so that that they can be extended in several ways. (Before everyone tells me I should have been using Architectural Forms and HyTime for this, please remember I'm still learning about them. I also have to use something that is understandable by my community.) <AXIOM> Every tag has to have code to process its semantics. The greater the variety of tags the greater the problem. </AXIOM> [Of course there are ways of reducing this, such as stylesheets and inheritance. However stylesheets primarily deal with one sort of tag - text - from which everything else is inherited. An object like <REACTION> requires more than current stylesheets deliver.] The consequence is that there is a direct interaction between the tagset and the postprocessing code - the DTD is largely irrelevant. I started off with a medium hardcoded DTD which I kept relaxing. For example, can a MOLECULE contain a PERSON? (ans. quite easily - the person who made it, or sells it). Can a PERSON contain a MOLECULE? (yes, we're made up of them). Indeed I could not think of useful content models which some application didn't require me to break later. Therefore one logical conclusion is that the DTD looks like: <!ELEMENT OBJECT (#PCDATA|OBJECT)*> <!ATTLIST OBJECT NAME CDATA #REQUIRED> This has a LISP-like feel so I suppose it would have its supporters. This is too generic for my community so I have tried to construct a few generic *Information Components* which apply to the scientific community. There are, roughly: - (hyper) text (inc XML HREF and some XML-LINK) - figures (i.e. structured diagrams with internal semantics) - images (bitmaps) - parsable graphs - parsable tables (unlike HTML3.2 which is often a set of trees, not a table) - bibliography - terminology - numeric data of various dimensionality, including UNITS, fuzziness etc. - relations between objects (XML-LINK) - math and then it would be extended by discipline-specific DTDs (which have to have their own processing code) like MOL, SMDL, etc. My hope is that the community will see the benefit of supporting these components. For example a decent HTML3.2 renderer is a key tool. (I hope that HTML3.2(XML) will be tweaked in the near future to allow XML-LINK). Ideally CGM would serve as a tool for structured diagrams (but I haven't seen a strong groundswell for CGM tools). The various MATH DTDs (which for we must include *parsable* math) could be included (though I still have to work out how to include symbolic variables in text and tables - that is dependant on the MATH DTDs). So TecML would include lines like: <!ENTITY % isomath 'IGNORE'> <!ENTITY % w3math 'IGNORE'> <![%isomath[ <!ENTITY % mathdtd "+//ISO//DTD MATH//EN"> <!-- I don't know the name --> %mathdtd; ]> <!-- and similarly for the others if they are to be used as alternatives--> XML has no mechanism for multiple namespaces (I assume SUBDOC and AFs cater for this?). Therefore the scientific/tecyhnical community could try to avoid namespace collisions *now*. It may be impossible without more powerful tools, but it's worth a try. If each DTD is limited to (say) about 20 tags it should be possible. I started CML by creating two namespaces, e.g.: CML.MOL XML.VAR <!-- *my* XML :-) --> and this avoided the clash with HTML's VAR, P, TITLE, etc. I see no reason why this can't be done, but it's a people problem not a technical one. *if* XML has a namespace registry, hierarchical tagsets could be very productive. <PROPOSAL> Within the XML/W3C effort, hierarchical tagnamespaces should be investigated </PROPOSAL> [Now, I been through the effort of trying to get MIME types registered and fallen at this hurdle, so it's not trivial. In fact it's quite similar.] The result could be that CML tags are of the form: CML.SYMMETRY whilst maths ones could be W3MATH.SYMMETRY If we don't mind long tagnames (and the article of faith says we don't) we can use URNs, FPIs, or any other resolving mechanisms in tagnames. <FOOTNOTE> A worry that I have with XML is that it won't become liberating except for very generic textual applications (where HTML makes quite a good job anyway). This leads to orthogonal communities using independent (and doubtless clashing) tagsets. E.g. the local store uses RETAIL.DTD, the bookie uses GAMBLE.DTD, the disco uses MUSIC.DTD and these never intersect. It comes very close to requiring separate plugins for each of these applications and XML is an unexciting part of the transport mechanism. </FOOTNOTE> So - are there some brave souls who believe that a significant number of disciplines can interoperate? My natural grouping would be the major learned disciplines (though not neccesarily with the learned socities in the vanguard :-( ). Math, chemistry, engineering, medicine, biology, etc. As a result of my analysis I have created a meta-like approach to describing scientific information, which uses only about 6 tags (and this could be reduced to about 2 if critical). - a generic container (XLIST). Content model ANY - a generic variable (XVAR). Content model #PCDATA - a generic n-dimensional array. Content model #PCDATA (essentially shorthand for homogeneous XVAR arrays. and a few support containers (really to help humans, rather than because they are needed): - BIB - a collection of XVARs relating to a citation - TERMENTRY - a glossary entry using XVARs from ISO 12620 - PERSON - a person Foreign objects (e.g. images) are handled by XNOTATION To have any value these objects must have semantics resolved. (*DTDs are no use at resolving semantics and give no help as to how to do it*. For my purposes, therefore, DTDs are largely irrelevant. The basic approach is to point these objects at *glossaries*, and this is now implemented in JUMBO. For example, instead of using <GRADE> it would be attractive to refer to the precise terminology for Grade from an authorititative body. Thus the IEEE is in the early stages of developing a set of standards for distance education, so this concept would be ideally represented by something like: <XVAR UNIT="percent" CONVENTION="IEEE.P1484" DICTNAME="GRADE">67</XVAR> JUMBO can retrieve the glossary contents (e.g. for UNITS) and apply them. In this way we have a potentially universal system for XML, which is scalable and robust and maintained independently and distributed by people who care. [We are developing this concept in the Virtual HyperGlossary (which is fundamental to JUMBO) (http://www.venus.co.uk/vhg).] The only technical problem is glossary resolution, and that seems to me to be identical with the problem of stylesheets and CATALOG/PUBLIC/SYSTEM, etc. View a glossary as a stylesheet - a set of terms with associated definitions and methods - and the problem seems eminently tractable. IMO it's one of the simplest ways to get the XML-friendly community working together > > I don't think that we can really expect that to happen often in the XML > world. People will just remove the DOCTYPE line and depend on > well-formedness. But having removed the DOCTYPE line, they have now > taken all responsibility for the semantics of that document upon > themselves. It can no longer be validated. The user agent cannot use any IMO validation by DTD is almost irrelevant for my problems - what can it add? I'm busily removing all my attribute namegroups - I would be arrogant if I thought I'd got those right for all time - and replacing them by CDATA. This means that new attribute values are added to the postprocessor and editor (though not directly to the DTD). **This means that validation is done by the postprocessor** - in my case the Java class for each element. For example, a molecule is invalid if it has a different number of atomic x-coordinates from y-coordinates, and it's impossible for the DTD to specify that. So complicated Elements have 1500 lines of Java code or which quite a lot is 'validation'. > "hard coded" knowledge of the semantics (with no doctype it doesn't know If there is a resolution mechanism as above then the semantics don't have to be hardcoded and can be added automagically by downloading Java classes. > what namespace the gis are from). "Alternate" stylesheets (e.g. text to > speech) are no longer useful (same problem). Search engines cannot > depend on the meta-data to be accurate. In short, I've hobbled the > interoperability of the document. We are all hurt by "tag soup", but ^^^^^^^^^ We are only hurt if we don't approach it collaboratively. IMO information components will become more important than DTDs and if we can stop them clashing, then I think XML has an enormous future for DTD-less working. > perhaps the visually impaired are the most hurt by it. I'm sure they > don't want to surf the web by continually reconfiguring their browsers > from scratch because authors have not properly declared their > namespaces. > > In my mind this is where the rubber meets the road. The rubber is > generic markup, where the needs of the data are paramount: add an > element if you need it. The road is the Internet where interoperability > is paramount. It seems to me that in XML it is too hard to balance these > factors. > > I don't think that we can make it easy to combine DTDs without changing > SGML. But maybe we can figure out a way to declare a namespace for > elements: to "import" element names in a standard way. You wouldn't be > able to validate the document but at least it would be clear what the > elements MEAN. I think we are in complete agreement here - and it's an exciting possibility. P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/
Received on Saturday, 3 May 1997 16:27:06 UTC