Re: Entity and Element Addressing from Peter Murray-Rust on 1997-01-30 (w3c-sgml-wg@w3.org from January 1997)

From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
Date: Thu, 30 Jan 1997 16:51:26 GMT
To: w3c-sgml-wg@www10.w3.org
Message-Id: <3006@ursus.demon.co.uk>
I hope this isn't going over solved ground, but I'd like to check up
about entities and the interpretation of 4.3.  The spec uses the word
'include' which I take to mean 'copy the entire contents of the
external entity' (normally a file) 'into the space vacated after removing
the &...; string'.

The spec emphasises the use of modularity in authoring (which I strongly
support).  Therefore a simple example would be:
<!DOCTYPE CML SYSTEM "cml.dtd" [
<!ENTITY bib1 SYSTEM "bib1.cml">
<!ENTITY bib2 SYSTEM "bib2.cml">
<!ENTITY mol1 SYSTEM "mol1.cml">
<!ENTITY mol2 SYSTEM "mol2.cml">
]>

<CML>
<XLIST TITLE="bibliography">
&bib1;
&bib2;
</XLIST>
<XLIST TITLE="molecules">
&mol1;
&mol2;
</XLIST>
</CML>

Assume that the files have structures like:
<BIB>
...
</BIB>
and
<MOL>
...
</MOL>

and are 'valid', then the whole document is a valid CML document.
The subsidiary files are not valid CML (they have no DOCTYPE) but they
are WF and in all other respects valid.  They are therefore valuable
reusable components (but see below).

This seems to be the intention of the draft.  However it is also possible
to create a valid document as (say)
...
<!ENTITY molfrag1 SYSTEM "molfrag1.txt">
]>
<CML>
<MOL>
&molfrag1;
</CML>

where molfrag1.txt contains something like:
<ATOMS> <!--* valid atom content here *--> </ATOMS>
</MOL>

i.e. the starttag is in one file and the endtag in another.  Whilst this
is horrible, it is the sort of thing that a mindless text processor might
do when sending chunks to a mailer with size restrictions.

It would also be possible to have both the start and the endtags in the 
main document.  I am not an expert on NOTATION but is seems that this
is required if including a foreign file, e.g. 
<FIGURE NOTATION="gif">
&mygif;
</FIGURE>

It therefore becomes difficult to say whether a document is or is not
WF without looking at the entities.

My motivation is that such document fragments may be useful both as
entities and as link-ends - 'the things the pointy bits point to' (if I 
have that correct).  In other words I might also wish to write something 
like:
<XLIST TITLE="Molecules">
<-XML-LINK HREF="mol1.cml"></-XML-LINK>
<-XML-LINK HREF="mol2.cml"></-XML-LINK>
</XLIST>
to reference the molecules.

However the semantics are differnt.  The second assumes that the application
will find something with a well defined structure of _some_ sort in the
files.  I'm still not clear
how it knows precisely what that structure _is_.  If mol1.xml is a complete
valid CML file (i.e. has a DOCTYPE statement and an accessible DTD) then
I know, otherwise I have to guess.  I _hate_ using file suffixes, and my
preference would be to include a MIME type somewhere in the LINK.  
The nearest I can get is the HRTYPE attribute - but this isn't clear in the
spec.  If it's allowed, I'd suggest:
<-XML-LINK HREF="mol1.cml" HRTYPE="application/x-cml"></-XML-LINK>

However, if this is allowed then the use of entities fails (since the
files all must have DOCTYPEs in them).  So I really want to be able to omit
the DOCTYPEs and use HRTYPE (or some other tool) to tell the application
'what is at the end of the pointy bit is a WF CML file and its nature
is determined by its first element.  Just assume there is a 
DOCTYPE at the front'

Of course I have to make sure that the context into which the file is
imported is sensible, but that's my problem!

 	P.

[BTW I am not a supporter of punctuation in NAMEs if it can be avoided.
For example, I create Java classes for most of my Elements directly
from their names.  -XML-LINK.java is illegal, and probably has to be
contracted to XMLLINK.java.  The (obvious?) underscore character doesn't
seem to be used in XML names (?)]

Peter Murray-Rust, (domestic net connection)
Virtual School of Molecular Sciences, Nottingham University, UK
http://www.ccc.nottingham.ac.uk/~pazpmr/
Received on Thursday, 30 January 1997 12:41:17 UTC