XML for beginners

Jon Bosak asked if I would like to contribute to this list as I have a 
particular interest in the success of XML. (I'm a scientist rather than a 
computer scientist so please forgive inaccurate terminology or reinventing
things that were known 20 years ago :-).

I believe that XML is precisely what is needed for a wide range of 
applications that cannot be served by HTML and are not appropriate for
complex SGML tools.  The WG has asked the question "What should XML
be aiming to do?" and my particular requirement is to carry precise data 
in precise data structures.  (By 'precise' I mean that XML provides
tools whereby the sender and the recipient can agree what the data mean and
how they are organised.) HTML cannot do this.

My own interest has been in transporting molecules and scientific data, 
but I believe that there are many other domains which have essentially the
same meta-requirements as my community.  (I have a suggestion of one or
two 'killer apps' which I can offer at a later stage.).  I have been using 
SGML for about 2 years and whilst I am sure it is technically the way 
forward, have some suggestions about the way XML should be presented.

Outside of the document processing community, SGML is unknown or 
misunderstood. If 10^6 scientists are using HTML, probably less than 1000 
understand what a DTD is.  Therefore although there is immense potential,
the number who could understand and implement the XML-spec _formally_
is very small.  The correct implementation of links is of fundamental 
importance  to us but the distillation of the ERB will need to be presented 
in a manner where relative non-specialists can get started in a gentle manner.  

[I think it is extremely important that XML is available to the 'average
hacker'.  One of the great strengths of HTML was that it was easy to
learn and was forgiving.  Once someone got started, they could then go on to
create hairy things in CGI, client-side helpers, etc.  For example, MIME
has been extremely valuable in chemistry, allowing very complex operations
for which few people understand the full basis.  There has been an 
implication that XML is only going to be useful if major browser m'facturers
implement it.  I certainly hope that it will be possible to develop many
applications without having to rely on this.  We found that our subject was
not important enough for the browser m'facturers to be interested in.]

I believe that whatever the theoretical basis of XML it should be possible
for people to start creating documents without having to understand it
all (just as for HTML).  Documents can be created with very simple 
syntactic constructs ("All start-tags must have a balanced end-tag and all 
attributes must have balanced quotes" covers most of it).   People are not
frightened of HTML by now, and I would hope that there can be a simple 
learning process for XML starting from HTML.  (I understand and agree that 
XML is SGML-, not HTML++ :-), but it may sometimes help if it is introduced
like this).

Although for many XML applications there will be automatic creation of
documents I hope that they can also be manually authored if required (with
WF-checking of course).  Here is a simple one:
<MOL TITLE="Water">
This carries precise information, and gives precise pointers to the 
definition of the information (e.g. that the IUPAC defintion of Boiling
Point was being used and that the units were in &deg;K rather than &deg;C.
It's clearly human readable.  I also believe that it's possible to write
a spec for authoring such documents where the SGML terminology is minimal.
(BTW The DTD for all examples is Chemical Markup Language 
(CML: http://www.venus.co.uk/omf/cml/) - don't worry , you don't need to
understand any chemistry!)

In developing CML I came to several conclusions which I hope may be useful.

(a) I could not foresee what people might wish to do with CML, so it had
to be a forgiving meta-language, capable of precision if required and 
flexibility if not.

(b) It is far easier to create a language than implement it.  I therefore
decided that I had to always have technology capable of prototyping my
ideas and seeing where they led.  

(c) I developed my own scheme for links and addressing, which I'll outline
below.  Personally I find coding for more than one level of indirection
very hard work.  My recommendation is that you should not come up with
a spec which is so complex that only gurus can write implementations.

(d) Everything in the DTD requires *someone* to write code in the 
postprocessor/application/renderer.  If you are solely concerned with 
rendering text there may be generic mechanisms (e.g. stylesheets).  However
many of my specialist ELEMENTs  have 1000+ lines of Java.  In general if
you hardcode an attribute with a list of values, then someone has to 
write code to implement it.  (For that reason I'm not very excited
about the hardcoded relations that are being suggested).

(e) CML is bound to evolve, and I think this may be true for
many XML applications - certainly if they want the vibrant evolution of
HTML.  CML/XML has the advantage that an application can be  prototyped
easily with relatively little concern about semantic checking
(e.g. in my example above if CONVENTION is omitted, it defaults to
DONTCARE so that people can create uncontrolled vocabularies and see how
they work out).  I would expect that when an application is useful and
needs to be standardised then a new XML DTD might be created, but
probably using fragments of the CML DTD.

(f) Much of the semantics is added through links.  Thus in the above
example the semantics of "BoilingPoint" are obtained through a controlled
vocabulary.  These vocabularies/glossaries can also supply methods (so
that in CML it is allowed to convert between units defined in a glossary).
Because links are so critical to CML I have provided considerable 

(g) Most CML 'documents' are likely to be hyperdocuments, possibly involving
non-SGML components.  I have therefore had to create a typed link tool
indicating the type of the link target.  For example there is an
attribute MIME which specifies the MIME type of the target.

(h) Data structures are extremely important.  CML needs to transport
(at least) lists, arrays, structs and tables.  I use a generic container
XLIST with appropriate attributes to control its behaviour.  I also
have a need for inheritance rather than containment (e.g. for biological
taxonomy) and whilst this must use the hierarchical containment syntax
it would be useful to have attributes that implied the ISA mechanism
was operating.

(i) Adressing within the document is important and may descend below the
granularity of ELEMENTs.  CML can address every (chemical) atom in a 
document even though these are normally not individually tagged.

(j) Document fragments are also important.  It is likely that complex
CML documents will be mounted on servers and people/programs will wish to
access part of the information (e.g. just the molecule, or just the 
authors' e-mail addresses).

(k) adresses and relations are sometimes meta-data, but are also sometimes
real data.  For example an author may say "These 3 atoms [...addresses ...]
react with those 4 [... more addresses ...]".  This is data, and I
provide for XVAR having a content of type ADDRESS.

(l) I came to the conclusion that I needed an ELEMENT <RELATION> which
could contain 1:1, 1:n and n:m relations.  This is not _fully_ worked out
or tested.

(m) I started using SGML ID/IDREF as the mechanism for links and relations.
However I so frequently created imperfect documents that I decided to 
borrow the NAME and HREF attributes from HTML.  These are available to
all my ELEMENTs, though in practice I use only NAME, and put URLs and
internal links in content (this allows it to be typed).

Here are a few examples:
<XVAR TYPE="URL" MIME="text/plain">readme.txt</XVAR>
(This is a hyperlink to a separate document whose type is required
before we decide how it is to be processed).

(Here MOL1-MOL4 are elements identified by NAME="MOL1" attributes; 
ARRAY is a generic array capable of holding several types of which 
ADDRESS is one).

<XLIST TITLE="Important citations">
<XVAR TYPE="URL">http://www.foo.bar</XVAR>
(This is a compilation of data items, not just links)

This was all before XML was announced.  (I may have reinvented some wheels 
but I couldn't find other suitable places to start - the TEI pointer 
wasn't quite what I was looking for and HyTime required a time and financial
investment that was beyond me.)  I would be delighted to modify or scrap
what I'd done if the WG can help with the following:

	- addressing of elements by attribute, context and possibly
		by content
	- (sub)addressing of content (e.g. by token or larger 'natural 
                component' (e.g. quoted string)
	- typing documents in URLs
	- an extensible means of typing relations
	- coordinating mechanisms for common data structures


Peter Murray-Rust, (domestic net connection)
Virtual School of Molecular Sciences, Nottingham University, UK