Re: revised restatement of the RE rules from lee@sq.com on 1996-09-25 (w3c-sgml-wg@w3.org from September 1996)

From: <lee@sq.com>
Date: Tue, 24 Sep 96 20:38:17 EDT
To: U35395@UICVM.CC.UIC.EDU, w3c-sgml-wg@w3.org
Message-Id: <9609250038.AA22254@sqrex.sq.com>
You know, if XML had the compiler-theory concept of phases, there would
be no problem whatsoever with &#RE;<!-- comment --> because the lexical
analyser would eat the comments entirely, so that the parser wouldn't
ever see comments at all.

This is trivial to implement and is how almost all modern languges do it.

It is unfortunate that <! is overloaded in SGML, making comment recognition
mach much harder than it need have been.  Had there been a separate comment
delimiter, e.g. <-- ..... -->, for use within the document, comments would
have been much simpler.

E.g. consider a parser reading a token at a time...

getToken(startMode)
{
    -- handle comments:
    if input is "<--"     --  start comment
    then
	eat up to "-->"
    endif

    if (inTextMode) { -- handle text
	if input is MDO
	then
	    do stuff
	endif
    } else if inMarkupMode { -- handle stuff in markup mode
    } else {
    .
    .
    .}
}

with this model, which I hope is clear from the incomplete code sketch
(the mode names are not meant to correspond to SGML modes in the example),
the parser would call getToken() repeatedly to read a file... but comments
are never returned, and hence not seen by the parser.

Obviously an editor can't use this model directly -- just as editors
for the C programming language have to retain comments today (e.g. Visual C,
Turbo C++, Brief, etc).

I suggest that it be made clear that in XML, all comments are elided
entirely before other parsing begins.  This does not mean that you have
to have multiple passes; it can be implemented concurrently very efficiently
using the mechanism I've outlined above.

Now, 
<P>
A
<?XXX>
B
</P>

is clearly exactly the same as

<P>
A

B
</p>

which I propose be considered the same as
<P>
A B
</P>
by considering any non-zero amount of whitespace to be exactly the
same as a single space -- again, as in just about all other modern languages.

Whitespace is any sequence of space, tab, form feed, newline, carriage return,
and vertical tab (does anyone still use that??).

Note that this makes
<P>
A
</P>
distinct from
<P>A</P>
but that is a necessary consequence of not requiring the use of a DTD.

It would be possible to use a syntactic distinction between what I shall
call formes and sticks.  By a forme I mean an element that cannot contain
text directly, rather like element context in SGML.  By a stick, I mean
an element that can contain text directly.  For example only, suppose
that the names of sticks are to be in lower case (in SGML, you'd have
to change NAMECASE (yes?) in the SGML declaration to do this).

<SEC>
<p>An
<emph>important</emph>
distinction</p>
<SEC>

becomes

<SEC><p>An <emph>important</emph> distinction</p></SEC>

I am not proposing this mechanism for XML.  I am describing it so as to
make clear that if you don't read the DTD, you have to be able to deduce
all the context you need by inspection.  If you need varying behaviour
from your tokeniser or parser in different contexts, those contexts must
be visible syntactically.

But it is simplest to do away with such things altogether.
Have a style sheet property telling you to discard leading/trailing
spaces in an element if you must, and make the rules the same anywhere.

Note that, just as in TeX, you would not be able to include unquoted
program source code with this proposal.

main()
{
    print("hello world\n");
}

might become

<program-listing>
main()&nl;
{&nl;
&tab;print("hello world\n");&nl;
}&nl;
</program-listing>

But this is not so bad.  How many troff-set books have I seen where
this program is printed as
main()
{
    print("hello world0);
}

because the \n wasn't escaped?

Lee
Received on Tuesday, 24 September 1996 20:38:32 UTC