Re: SD4 - Schema format from Peter Murray-Rust on 1997-05-16 (w3c-sgml-wg@w3.org from May 1997)

From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
Date: Fri, 16 May 1997 13:29:31 GMT
To: w3c-sgml-wg@w3.org
Message-Id: <6650@ursus.demon.co.uk>
This is a general response to the three proposals already in (and the US
isn't fully awake yet :-) -

In message <9374.199705161115@grogan.cogsci.ed.ac.uk> "Henry S. Thompson" writes:
> 
> Constructing the full vanilla DTD is left as an exercise for the
> reader :-).
> 
Constructing the full processing software is left as an exercise for the
MCSGS...

[... rest of proposal omitted ...]

I read these proposals from the point of view of an *implementor*, albeit
one with no CS training, and would like proposers to keep implementation very
much in their minds.  I still believe in the idea that *an* XML 
infrastructure can be built by individuals in the virtual XML community and,
indeed, that this is critical in working out the language as we go along.
Looking at the proposals so far, I feel some way from being clear how they
would be implemented in conjunction with what we have at present.  That's
not to say they aren't the right way forward, but simply that we must bear 
this in mind (I've been involved with other language developments that
crashed because they were too expansive).  There are semantics in XML-LINK
(which I have implemented in JUMBO) which have not been fully tested and
I, at least, don't find completely trivial.

It's worth remembering that apparently simply concepts such as parameter
entities and whitespace take a LOT of care to define precisely.  If they 
aren't so designed, it's highly probable that different implementations will
produce different results.

<TERMINOLOGY>
Are the proposals (SD[1-5]) seen as part of XML-lang, or are they a new
XML-name? If the former, we have a *lot* of work to do before XML-lang
is finalised; if the latter, then documents written to these proposals 
may break XML-lang software and will in any case require additional 
processing.
</>

Let me suggest the architecture which will be necessary for a *generic*
XML processing system (i.e. without a browser, without stylesheet, without
any domain-specific stuff.) [In JUMBO this is localised in *part* of a 
package called jumbo.sgml.  It might be possible to separate it further].

Some of the solutions proposed appear either to be incompatible with SGML
and/or to produce documents which may break current XML (and SGML) parsers.
An obvious example is namespace collisions between elementTypes in DTDs - I
shall focus on this example.  At present I see three modules as being 
required in a generic vanilla XML system (e.g. w/o browser):

pre-parser -> parser -> post-parser

<NOTE>
If this model is simplistic or inappropriate, please say so and suggest 
another :-)
</>

The parser is the simplest to start with and can be exemplified by (n)sgmls,
Lark or NXP.  It takes an XML-compliant document as described in the spec
and may validate it.  (Throughout this diussion I shall assume that 
validation at various levels is a requirement).  It then may produce output
which is completely undefined by the spec, but three examples are Esis
(NXP), abstract tree (Lark) and groves.  Please correct me, but I believe
that all three cover the same space in the pipeline above.

If a document is prepared to specifications like the later proposals, then 
it *may* not be XML- or even SGML- compliant.  [This is  something that needs
elaborating.]  If it is not XML-compliant, then either (a) the parsers need
redefining and may require a context-sensitive grammar or (b) a pre-parser
is required that converts the input to be XML-compliant.  Assuming the
latter, it might take multiple DTDs and expand their GIs to be more fully 
qualified, e.g.

<!-- part of CML DTD -->
<!Element VAR (#PCDATA)>

gets preprocessed to

<!Element cml.VAR (#PCDATA)>

before it is read in to the parser.  This may be manageable, but it's not
trivial to keep it all together.  

The post-parser seems essential.  We are seeing proposals for additional
elements, PIs, multiple attributes etc, which have to be processed at a 
considerable level of complexity.  XML-link defines an inheritance mechanism
and it's critical that everyone does this the same way - at present I'm
not aware of any implementations of XML-link other than JUMBO, so
I can't check my interpretation.  Proposals for multiple inheritance
worry me ('Java in a Nutshell' says (p77) "Multiple inheritance opens up
a can of worms") and I would probably give up if it were a requirement.

The post-parser has a great deal to do.  XML-link defines a number of
syntactic constructs that will not be checked by the DTD and yet any
reputable processor should check (shouldn't it?).  Then there are
the transformations, and the new proposals are implicitly describing a complex
set of these.  

So, are we still agreed that it should be easy to implement XML-name?  Or 
are we expecting that it needs teams of programmers and will be left to one 
or two major enterprises to do it?  

If it's possible, could I suggest that proposals in this area try to answer
the following questions and post the answers - it would certainly help me.

(A) Can parsers compliant with the 9704 XML spec parse the suggested documents 
(including all DTDs)? If not,
(A1) is it proposed that the parsers be altered to allow this?
or
(A2) is some pre-parsing software proposed?  If so, what basic operations
must it carry out?

(B) Will the proposal require a post-parsing process?  If so
(B1) what mechanisms (e.g. inheritance, parsing of PIs, etc.) will be 
required?
(B2) what validation will the post-parser be expected to carry out?

I appreciate that this may not be suitable for all proposals, but any help
will be appreciated.

	P.

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/
Received on Friday, 16 May 1997 09:17:46 UTC