- From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
- Date: Fri, 24 Jan 1997 21:58:19 GMT
- To: w3c-sgml-wg@www10.w3.org
Jon Bosak asked if I would like to contribute to this list as I have a particular interest in the success of XML. (I'm a scientist rather than a computer scientist so please forgive inaccurate terminology or reinventing things that were known 20 years ago :-). I believe that XML is precisely what is needed for a wide range of applications that cannot be served by HTML and are not appropriate for complex SGML tools. The WG has asked the question "What should XML be aiming to do?" and my particular requirement is to carry precise data in precise data structures. (By 'precise' I mean that XML provides tools whereby the sender and the recipient can agree what the data mean and how they are organised.) HTML cannot do this. My own interest has been in transporting molecules and scientific data, but I believe that there are many other domains which have essentially the same meta-requirements as my community. (I have a suggestion of one or two 'killer apps' which I can offer at a later stage.). I have been using SGML for about 2 years and whilst I am sure it is technically the way forward, have some suggestions about the way XML should be presented. Outside of the document processing community, SGML is unknown or misunderstood. If 10^6 scientists are using HTML, probably less than 1000 understand what a DTD is. Therefore although there is immense potential, the number who could understand and implement the XML-spec _formally_ is very small. The correct implementation of links is of fundamental importance to us but the distillation of the ERB will need to be presented in a manner where relative non-specialists can get started in a gentle manner. [I think it is extremely important that XML is available to the 'average hacker'. One of the great strengths of HTML was that it was easy to learn and was forgiving. Once someone got started, they could then go on to create hairy things in CGI, client-side helpers, etc. For example, MIME has been extremely valuable in chemistry, allowing very complex operations for which few people understand the full basis. There has been an implication that XML is only going to be useful if major browser m'facturers implement it. I certainly hope that it will be possible to develop many applications without having to rely on this. We found that our subject was not important enough for the browser m'facturers to be interested in.] I believe that whatever the theoretical basis of XML it should be possible for people to start creating documents without having to understand it all (just as for HTML). Documents can be created with very simple syntactic constructs ("All start-tags must have a balanced end-tag and all attributes must have balanced quotes" covers most of it). People are not frightened of HTML by now, and I would hope that there can be a simple learning process for XML starting from HTML. (I understand and agree that XML is SGML-, not HTML++ :-), but it may sometimes help if it is introduced like this). Although for many XML applications there will be automatic creation of documents I hope that they can also be manually authored if required (with WF-checking of course). Here is a simple one: <MOL TITLE="Water"> <XVAR UNITS="Kelvin" DICTNAME="BoilingPoint" CONVENTION="IUPAC">373.18</XVAR> </MOL> This carries precise information, and gives precise pointers to the definition of the information (e.g. that the IUPAC defintion of Boiling Point was being used and that the units were in °K rather than °C. It's clearly human readable. I also believe that it's possible to write a spec for authoring such documents where the SGML terminology is minimal. (BTW The DTD for all examples is Chemical Markup Language (CML: http://www.venus.co.uk/omf/cml/) - don't worry , you don't need to understand any chemistry!) In developing CML I came to several conclusions which I hope may be useful. (a) I could not foresee what people might wish to do with CML, so it had to be a forgiving meta-language, capable of precision if required and flexibility if not. (b) It is far easier to create a language than implement it. I therefore decided that I had to always have technology capable of prototyping my ideas and seeing where they led. (c) I developed my own scheme for links and addressing, which I'll outline below. Personally I find coding for more than one level of indirection very hard work. My recommendation is that you should not come up with a spec which is so complex that only gurus can write implementations. (d) Everything in the DTD requires *someone* to write code in the postprocessor/application/renderer. If you are solely concerned with rendering text there may be generic mechanisms (e.g. stylesheets). However many of my specialist ELEMENTs have 1000+ lines of Java. In general if you hardcode an attribute with a list of values, then someone has to write code to implement it. (For that reason I'm not very excited about the hardcoded relations that are being suggested). (e) CML is bound to evolve, and I think this may be true for many XML applications - certainly if they want the vibrant evolution of HTML. CML/XML has the advantage that an application can be prototyped easily with relatively little concern about semantic checking (e.g. in my example above if CONVENTION is omitted, it defaults to DONTCARE so that people can create uncontrolled vocabularies and see how they work out). I would expect that when an application is useful and needs to be standardised then a new XML DTD might be created, but probably using fragments of the CML DTD. (f) Much of the semantics is added through links. Thus in the above example the semantics of "BoilingPoint" are obtained through a controlled vocabulary. These vocabularies/glossaries can also supply methods (so that in CML it is allowed to convert between units defined in a glossary). Because links are so critical to CML I have provided considerable flexibility. (g) Most CML 'documents' are likely to be hyperdocuments, possibly involving non-SGML components. I have therefore had to create a typed link tool indicating the type of the link target. For example there is an attribute MIME which specifies the MIME type of the target. (h) Data structures are extremely important. CML needs to transport (at least) lists, arrays, structs and tables. I use a generic container XLIST with appropriate attributes to control its behaviour. I also have a need for inheritance rather than containment (e.g. for biological taxonomy) and whilst this must use the hierarchical containment syntax it would be useful to have attributes that implied the ISA mechanism was operating. (i) Adressing within the document is important and may descend below the granularity of ELEMENTs. CML can address every (chemical) atom in a document even though these are normally not individually tagged. (j) Document fragments are also important. It is likely that complex CML documents will be mounted on servers and people/programs will wish to access part of the information (e.g. just the molecule, or just the authors' e-mail addresses). (k) adresses and relations are sometimes meta-data, but are also sometimes real data. For example an author may say "These 3 atoms [...addresses ...] react with those 4 [... more addresses ...]". This is data, and I provide for XVAR having a content of type ADDRESS. (l) I came to the conclusion that I needed an ELEMENT <RELATION> which could contain 1:1, 1:n and n:m relations. This is not _fully_ worked out or tested. (m) I started using SGML ID/IDREF as the mechanism for links and relations. However I so frequently created imperfect documents that I decided to borrow the NAME and HREF attributes from HTML. These are available to all my ELEMENTs, though in practice I use only NAME, and put URLs and internal links in content (this allows it to be typed). Here are a few examples: <XVAR TYPE="URL" MIME="text/plain">readme.txt</XVAR> (This is a hyperlink to a separate document whose type is required before we decide how it is to be processed). <RELATION TITLE="Reaction"> <ARRAY TYPE="ADDRESS" DICTNAME="Reactant">MOL1:3 MOL2:4</ARRAY> <ARRAY TYPE="ADDRESS" DICTNAME="Product">MOL3:7 MOL5:2</ARRAY> </RELATION> (Here MOL1-MOL4 are elements identified by NAME="MOL1" attributes; ARRAY is a generic array capable of holding several types of which ADDRESS is one). <XLIST TITLE="Important citations"> <XVAR TYPE="ADDRESS">CITATION1</XVAR> <XVAR TYPE="URL">http://www.foo.bar</XVAR> </XLIST> (This is a compilation of data items, not just links) This was all before XML was announced. (I may have reinvented some wheels but I couldn't find other suitable places to start - the TEI pointer wasn't quite what I was looking for and HyTime required a time and financial investment that was beyond me.) I would be delighted to modify or scrap what I'd done if the WG can help with the following: - addressing of elements by attribute, context and possibly by content - (sub)addressing of content (e.g. by token or larger 'natural component' (e.g. quoted string) - typing documents in URLs - an extensible means of typing relations - coordinating mechanisms for common data structures Peter -- Peter Murray-Rust, (domestic net connection) Virtual School of Molecular Sciences, Nottingham University, UK http://www.ccc.nottingham.ac.uk/~pazpmr/
Received on Friday, 24 January 1997 18:54:25 UTC