Re: ERB meeting, 30 October 1996

Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU> wrote:
> Agreed unanimously:
>  - to use the string '/>' as the NET delimiter in the SGML
>    declaration of XML documents

Joe English wrote:
> I strongly urge the ERB to reconsider this.
> The /> NET trick is extremely fragile.  Most current
> SGML parsers cannot even implement it, and those that do
> cannot enforce its proper usage.


Note also that allowing <e> means that either
(1) every XML parser has to be able to cope with parsing a DTD,
    merely to deal with <e>, or

(2) sending an XML file without a DTD is an error if it contains an
    element that was declared with content model EMPTY.
    However, an XML parser that doesn't handle a DTD won't be able
    to cope with this error easily -- it will simply assume
    containment, and get to the end of the document.  Probably it
    will then close all open elements, maybe issue a "document may
    have been truncated' error, and cheerfully continue, with
    a *completely wrong* interpretation of the document.
    Or, if you prefer, with a dramatically different ESIS.

    This is unacceptable.

I realise <@e> wasn't very popular.  But it has neither of these failure
cases, and, unlike the NET proposal, is robust against error -- it would
be an error to put an @ in front of an element not declared EMPTY,
because until SGML's syntax can be improved, the @ would be in the
element declaration.  Afterwards, it could be implied -- you'd set the

Is error checking really so hard to figure out?
Ask yourself
(a) what happens if I forget to use the special marker?
    (1) with <e/ or <e/>:
	If there's a DTD, and we're in an SGML application nothing.
	NET is always optional.  If we're an XML application using
	NSGMLs or some other ESIS-providing system (but can't be SGMLS,
	as it won't let you do the NET trick I think), it cannot be
	detected, since ESIS doesn't include minimisation information.

	An XML-specific application convention might say that an element
	using the NET feature must declared EMPTY.  In that case, when we
	reach the end of the document, we declare a parse error and exit.
	(see (c) for the no DTD case)

    (2) with @
	If there's a DTD, we detect the error immmediately, with today's
	unmodified SGML.
	    although the @ is part of the element name (a kludge, I admit),
	    it is not polluting the reference concrete namespace -- there
	    cannot be any existing document types using the RFC in which
	    GIs began with "@".]

    (3) with <e>
	If you put a </e> in the document, and there's a DTD, it's an error
	and is detected.

(b) what happens if I use the special marker where I shouldn't?
    (1) with <e/ or <e/>:
	It's legal in SGML.  People using SGML tools will get very
	confused.  But maybe this will help discourage that practice,
	when XML tools become available.
	In XML with a DTD, you'll get an error.
	If the XML parser is just an SGML parser and uses ESIS, NET
	information is not available to it, and it is no longer an error
	(since it's legal in SGML).

    (2) with @
	If there is a DTD (whether ESIS or no, whether XML or SGML) an
	error is immediately given at the correct point of the error.

    (3) with <e>
	Meaningless in this case.  No error.
(c) what happens without a DTD, when I don't make an error?
    (1) with <e/ or <e/>:
	With XML and the application convention, everything is OK.
	You can't use NSGMLS and ESIS to implement this, though, since
	the fact tht NET was used is not included in ESIS.  (Using the
	grove mechanism is probably harder than writing a parser in C
	from scratch for most non-SGML programmers, I would guess)

    (2) with @
	The @ is part of the element name (with unmodified SGML) and gets
	passed through to the ESIS and the application.  Hence, this
	approach works in all cases.  It might not work with SGMLS, as
	it's not pure reference concrete syntax any more.  But it should
	work with NSGMLS out-of-the-box (yes?  James?)

    (3) with <e>
	This doesn't work without a DTD>  You're hosed.  Broken.
	Back to the drawing board.  We have two languages if this is
	allowed.  Language (1) is only for documents with a DTD.
	Language (2), XML, does not have empty elements.

Note that allowing both <e> and <e/ is harder to implement than only
allowing one of them -- it's harder than either of them alone.
And allowing <e> at all means that every XML parser has just got a lot
harder to write.

If you allow <e>, you might as well not bother with <e/>, as it doesn't
save the parser writer anything (still has to deal with old-fashioned
EMPTY elements).

Worse, neither of the two mechanisms accepted by the ERB has good
error handling characteristics (as Joe English pointed out and I have
attempted to enumerate).

Oh well.  If we stick with this, I propose to implement _only_ the
DTDless XML, with no support for EMPTY at all, for my own use.
It'll be _much_ quicker to write the code.  If my implementation
gets spread around quickly enough, it will become the defacto
standard.  Later, maybe I'll add #include nd other neat features I like.