Delimited pseudoelements from Joe English on 1996-10-01 (w3c-sgml-wg@w3.org from October 1996)

From: Joe English <jenglish@crl.com>
Date: Tue, 01 Oct 1996 14:35:30 -0700
To: w3c-sgml-wg@w3.org
Message-Id: <199610012135.AA01379@mail.crl.com>
Regarding the proposal to require that all data content
in XML be delimited:

As I understand it, this proposal attempts to solve
two problems: that of distinguishing element content from
mixed content, and that of RS/RE processing.

Since ISO 8879:1986 has no PELO/PELC ("pseudoelement open"/
"pseudoelement close") delimiter roles, the issue of
SGML compatibility must be addressed.

The following options are available for processing
an XML document as SGML:

    1a) use a native XML parser to build a grove or ESIS stream;
    1b) use an XML-to-SGML translator;
    1c) wait for ISO 8879:199X (and tools that support it);
    1d) use SHORTREF tricks in the DTD;
    1e) something else that I've missed.

<myopinion>
(1a), (1b), and to a lesser extent (1c) would severely limit
XML's utility.  If XML can't be directly processed with
existing SGML tools, there is little point in putting
an SGML-like notation on the Web.  If a separate XML parser
is required, we'd be better off using a completely new
syntax that is *significantly* easier to parse than something
SGMLoid;  Lisp S-expressions might be a good choice.
</myopinion>

(1d) would allow XML instances to be parsed as SGML,
but introduces the new problem of DTD incompatibility.
(I'm still operating under the assumption that *some*
XML applications will need to examine the DTD.)

We can get around this by stating in the XML spec
something to the effect of "Ignore <!SHORTREF> and
<!USEMAP> declarations; they're black magic intended
for consumption by SGML parsers."  Inelegant, but
workable.

It is unclear (to me) what the SHORTREF black magic
should be.  If I understand correctly, something like the
following was proposed:


    <!-- where "pel" is an element type name not defined elsewhere
	 in the DTD, and the DTD uses "pel" instead of #PCDATA in all
	 other content models.
    -->

    <!ELEMENT pel - - (#PCDATA) >

    <!ENTITY sr.pelo STARTTAG pel >
    <!ENTITY sr.pelc ENDTAG pel >
    <!ENTITY sr.escaped-pelc CDATA '"'>

    <!SHORTREF element-content
	'"'	sr.pelo
    >
    <!SHORTREF data-content
	'"'	sr.pelc
	'\"'	sr.escaped-pelc	-- NB: won't work with RCS --
    >
    <!USEMAP data-content pel>
    <!USEMAP element-content ...everything else...>


With this scheme, an XML parser will *not* produce the same
output as an SGML parser: the latter will construct a grove
with all PEL nodes inside "wrapper" EL nodes that will
be absent in the grove constructed by the former.

Presumably this can be handled by an application convention
on the SGML side that "pel" elements are to be ignored,
or with another rule for XML that the PELO/PELC delimiters
are equivalent to "pel" start- and end- tags -- in other words,
requiring XML to interpret (or act as if it interprets) 
the short references after all.

This scheme doesn't solve all of the RS/RE weirdness:

    <foo>
    "
    This data contains two delimited REs
    that are discarded by SGML parsers.
    "
    </foo>

Maybe:

    <!ENTITY sr.pelo "<pel>&#RE;"  >
    <!ENTITY sr.pelc "&#RE;</pel>" >

would do the trick.


OK... so I've convinced myself at least that option (1d) is
workable from a technical standpoint (though it excludes SGML
parsers that don't support extending the short reference
delimiter set, or don't support SHORTREF at all).  Is
there a better solution that I've missed?


This leaves the problem in the hands of XML producers.

It seems to me that the best way to produce XML will be to
author in "full" SGML and down-translate.  Ideally the
down-translation will be a simple normalization process,
and with any luck most "native" SGML editors already
save files in a form that will be suitably normalized
for consumption as XML.

The only restriction on XML-able DTDs so far is that they cannot
contain EMPTY elements or #CONREF attributes (unless we can come
up with a way for DTD-less parsers to handle those too).

However, very few (if any) existing DTDs utilize the short reference
tricks required by option (1d); if these are required to make
a DTD XML-able, it places a significant burden on XML producers.

Do we expect them to use XML-able DTDs for all data in its
"native" format?  Is there a way to automatically convert
existing DTDs to add the required SHORTREF declarations
and perform #PCDATA substitutions?  Do we require that providers
maintain two copies of all their DTDs, one for use with their
local SGML tools and one for publication as XML?

I'm not sure how to address this issue.


<myopinion> <![ TEMP [

I think that the problem of distinguishing element content
from mixed content is best solved by a markup convention
forbidding SEPCHARs in element content.  The cost of this
solution -- that XML input cannot be formatted for editing
the way most users would like -- is less than the cost of
adopting option (1d).

I think that the problem of RS/RE processing is best handled
simply by keeping the rules from ISO 8879.  If the rules can be
explained in 14 lines of prose and implemented with a 4-state
DFA, then RS/RE processing is at most a minor annoyance; it's *not* a
major problem.

I think that "RE delenda est" is the other best way to handle
the RS/RE problem.  Moreover, if we adopt the markup convention above
then Charles' objections concerning the ontogeny of individual record
ends are moot: if authors are forbidden to use whitespace to
format element content, why should data content be any different?

]]> </myopinion>



--Joe English

  jenglish@crl.com
Received on Tuesday, 1 October 1996 17:35:19 UTC