SD5 - Namespaces (Implementation questions) from Peter Murray-Rust on 1997-05-18 (w3c-sgml-wg@w3.org from May 1997)

From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
Date: Sun, 18 May 1997 10:03:46 GMT
To: w3c-sgml-wg@w3.org
Message-Id: <6780@ursus.demon.co.uk>
The publication of the MathML DTD paper has catalysed my thinking on 
namespaces.  As an implementer I'd like to ask some factual questions and 
also to present what I see as *possible* (not necessarily desirable)
solutions.  It may then be possible to see in future discussion whether
we already have solutions, whether we have potential solutions but no
software, or whether we have to invent new syntax and write software for
it.  I regard namespaces as *the* critical problem of the 5 questions - the
others can all be managed in some way at present; namespaces can't.  If this
problem isn't addressed quickly, the next six months will see de facto
namespaces in XML that will take years to recover from, if at all.

This (long) posting builds on earlier postings on this thread and proposes
a simple, trivially implementable, solution to the problem, at least for 
the interim.

<VECTOR XML-TYPE="AXIOM">
Any solution must be SGML-compatible <SEP/> 
Any solution must work BOTH for documents with and without DOCTYPEs or
validation <SEP/> 
Any solution must be reasonably cheap to implement
</VECTOR>

<Q>Can SGML, in all its glory, solve this problem *in practice* at present?</Q>
I really have no idea of the answer to this.  My understanding is that there
are potentially two solutions:

<SOLUTION>
SUBDOC.  From my reading this allows one to switch DTDs in mid-processing.
I understand that not all parsers support it.  Having been weaned on sgmls
I find it supported there, but no explicit documentation on how to use it.
I also have no idea what the output is.
</SOLUTION>

<SOLUTION>
AFs and/or HyTime.  I am repeatedly assured by the evangelists - and I believe
them - that AFs and HyTime will solve all our problems.  This is clearly a 
millenium solution for SGML, but not for July 1, 1997.  Currently it ('it' 
refers to the universal solution) has the following drawbacks.
(a) It's not easy to understand, and I don't. [Several people have both 
talked to me and posted explanations for which many thanks, but I'm a slow 
learner.]
(b) There is no simple introdcution with examples.
(c) There is no inexpensive working system that I can play around with.
(d) I am frightened that the *output* of a HyTime engine (or whatever) is
sufficiently complex that the post-parsing/application engines must take on a
further level of sophistication.
</SOLUTION>

<API>
I am concerned that we are still no further forward with an 'API' for parser
'output'.  The three options seem to be: ESIS, ElementTree and groves.  I
assume that some of the proposed solutions may require other APIs and
be incompatible with the ESIS approach, and possibly the ElementTree.  (I
take on faith that you can do anything with groves).  If I'm right, then 
we need to know this up front, because otherwise we are beavering away
writing software that is doomed.
</API>

<Q>
If SUBDOC or AFs are required as the solution, how much work in MCSG-months
(1997 revised level) is required?  
</Q>

<A XML-TYPE="hypothesis">
A lot.  And then some more.
</A>

If SUBDOC/AFs are going to be an essential requirement for managing XML
namespaces we'd better hear the bad news now and not some months down the
track.  We don't have to *do* anything in V1.0 except:
	- tell people to be patient
	- stop them developing horrible kludges
	- work out how much effort it's going to be
	- try to create a timescale
	
Here are some other approaches which seem to be possible.  None are without
problems.

<PROBLEM>
An earlier posting suggested that the problem was primarily one of semantics. 
It's not - it involves syntax as well.  Consider (from MathML):
<![CDATA[
<!Element PRODUCT (%MathExp;)*>
]]>

and (hypothetically) from CML

<![CDATA[
<!Element PRODUCT (MOL)*>
]]>

The content models and ATTLISTs (not shown) of these are quite different and 
so any document combining both of these will be invalid.  That's the 
unavoidable syntactic problem *if validation is required*.
</PROBLEM>

<SOLUTION>
Accept that there will be WF documents that cannot be validated however
well-intentioned the authors are.  I do not like this (although I'll
probably practise it, especially when evangelising XML).  However of all
the solutions, it's one that works today.
</SOLUTION>

<SOLUTION XML-TYPE="recommended">
Require that all documents which include DTD fragments use FullyQualifiedNames.
As an example, 'Java in a Nutshell'  (p.19) asserts that all O'Reilly Java
hackers use the form:
	com.ora.writers.david.widgets.Barchart.display().
for their Java classes.  (BTW - ORA denizens - does it work OK?)
This is effectively a URL and therefore unique in the world.  In XML terms
this would mean that the MathML PRODUCT would become:
	org.w3.mathml.PRODUCT
To me this is very attractive because:
	- it requires no new technology
	- is guaranteed to work
	- is automatically linked to semantics and other behaviours
	- gives newcomers to SGML the idea they need to be thoughtful
		about this.
Its only drawbacks are:
	- it's long-winded (but we don't care about that, do we?).
	- it requires an agreed naming convention within XML (i.e. that
		any ElementType starting with (com, edu, country-name, org,
		etc.) and followed by '.' is taken as a fully qualified name
	- until URNs come along some FQNs may change as people move (but
		it can't be worse than *not* doing this).
In summary, even if we merely *encourage* people to use this we are solving
99% of the potential problems.  None of you will *complain* if I write:
	<uk.ac.nottingham.cml.MOL>
I suspect that even if we adopt other solutions there is still a role for
FQNs undet the surface.  The parser would output an ESIS stream that looked 
something like:
(uk.ac.nottingham.cml.CML
(uk.ac.nottingham.cml.PRODUCT
(uk.ac.nottingham.cml.MOL
(uk.ac.nottingham.cml.ATOMS
...
)uk.ac.nottingham.cml.ATOMS
)uk.ac.nottingham.cml.MOL
)uk.ac.nottingham.cml.PRODUCT
(org.w3.mathml.PRODUCT
(org.w3.mathml.EXPR
...
)org.w3.mathml.EXPR
)org.w3.mathml.PRODUCT
)uk.ac.nottingham.cml.CML

Similar things would be possible for ElementTrees.

The application might decide that for presentation purposes it could strip
off the FQN leaders without losing information, e.g. it could label tags
with the short form and pretty icons linked to the FQNs.  It would however 
preserve *all* of the namespace in the document using *tools that exist at 
present*.

It should be attractive to HTML-hackers who are quite accustomed to 
filling their pages with 'http://www.w3.org/some/where' and don't complain.
So the only real problem is how to make the documents authorable.

</SOLUTION>

<PROPOSAL TYPE="ancillary">
It's becoming clearer daily that there is a requirement for a pre-parser
for manipulating documents before they go into the parser.  MathML has 
already lifted the lid on this by suggesting that there should be
macros to insert length boilerplate (the boilerplate being legal XML).
(I suspect this cannot easily be done with entities since there is argument
substitution, but I haven't looked in detail.)  The MathML documents 
before macroprocessing are presumably not validatable but may be WF - I 
haven't checked.

This is  therefore a proposal for a pre-processor for XML documents which
cannot be validated, but are WF and which *will be validatable after 
preprocessing*.  It will require (minor) changes to the
language and (minor) changes to current processing software which validates.
It requires no changes for documents and tools dealing with WF documents.

<PROPOSAL>
There should be a PI which expands GIs within a scope to (more) fully
qualified names.  This PI may be ignored by WF processors, but they are 
advised to pass it to the application.  For validation the parser is required
to perform this expansion.  
</PROPOSAL>

Example
<![CDATA[

<!DOCTYPE cml SYSTEM "cml.dtd" [ <!-- unqualified, just as an example -->
<!-- an additional version the math dtd *using* FQNames; i.e. long-winded -->
<!entity % mathml SYSTEM "http://www.w3.org/some/where/fqmathml.dtd">
%mathml;
]>
<CML>
<MOL Title="benzene">
...
<XLIST TITLE="symmetry operators">
<?XML-DOCTYPE DOCTYPE="org.w3.mathml"?>
<MATRIX>
<MATRIXROW>1<SEP/>0<SEP/>0<SEP/></MATRIXROW>
<MATRIXROW>0<SEP/>0<SEP/>1<SEP/></MATRIXROW>
<MATRIXROW>0<SEP/>1<SEP/>0<SEP/></MATRIXROW>
</MATRIX>
<?XML-DOCTYPE?>  <!-- exists from math scope -->
...
</MOL>
</CML>
]]>

Note that this is *valid WF XML as of today*  (assuming no typos).  It's 
potentially invalid XML because MATRIX might collide with CML (actually
it doesn't).  However, after pre-processing to stick the 'org.w3.mathml'
on the front it's guaranteed to be validatable (assuming we reserve
the DomainName space for the FQNs).

All that is required is for the XML-ERB to (a) reserve this syntax for 
GIs (b) reserve the namespace for *the first component* of XML GIs.  The
only implementors who need to worry are those who write validating parsers
and this should be a trivial addition to their task.

</IMPLEMENTATION>

	P.


-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/
Received on Sunday, 18 May 1997 07:03:46 UTC