Re: Mixed vs. element content (Was Re: RS/RE, again (sorry)) from W. Eliot Kimber on 1996-12-17 (w3c-sgml-wg@w3.org from December 1996)

From: W. Eliot Kimber <eliot@isogen.com>
Date: Tue, 17 Dec 1996 12:00:02 -0900
To: Derek Denny-Brown <ddb@criinc.com>, Paul Prescod <papresco@calum.csclub.uwaterloo.ca>
Cc: w3c-sgml-wg@w3.org
Message-Id: <3.0.32.19961217115947.00a98bd4@uu10.psi.com>
At 09:33 AM 12/17/96 -0800, Derek Denny-Brown wrote:
>    ...I am just worried that the proposals being brought forth will break
>HyTime when applied to XML.  Given that I was actively involved with
>drafting the forthcoming HyTime TC, it is important to me that HyTime is not
>completely abandoned, when it need not be.  

I don't think it's a HyTime-specific issue, both because the problems are
not unique to HyTime and because the use of HyTime is not dependent on how
the parsing process is defined.

All location addressing, whether HyTime-defined or not, operates on an
abstraction of the data, not the original source data.  This means that you
have to choose your abstraction carefully, which is what we're really
talking about in this whole RS/RE fracas.  

There are two levels of abstraction that we usually work with: 

1. The immediate result of parsing.  
2. The result of applying application-specific semantics to the
   results of parsing.

There may be more levels of abstraction, but we haven't exposed those yet
in our discussions of XML processing.

Abstraction (1) is what HyTime and DSSSL call the "SGML document grove" or
the "pGrove" (for parse grove).  What can occur in this grove is completely
defined by the SGML property set (published in the DSSSL standard and soon
to be published again in the HyTime TC) and reflects simply applying the
SGML parsing rules to the input document.  It is roughly equivalent to
"ESIS" except that the grove may be more complete and you have a formal way
to say what you want to be in the grove (the "grove plan").

Abstraction (2) is what HyTime calls the "extended SGML document grove", or
"epGrove".  This is a new grove with HyTime-specific semantics applied.  It
uses the same propery set as the first but may either suppress or remove
some things or may modify the content to reflect HyTime-specific semantics.

Any application is free to create it's own extended document grove.  XML
processors will, presumably, provide their own XML-specific extended
document groves to reflect XML-specific semantics (for example, that
whitespace is collapsed when the -xml-space attribute is in effect).

The -xml-space attribute is a good example of how this works in practice:
an XML parser parses a document and creates a pGrove that contains all the
data characters it found.  If the document has no DTD, then this means all
white space characters, not just those in what we know to be mixed content
(white space that is not taken as data is held in "markup" properties,
which are not, by definition of the SGML property set, content of the
objects that exhibit them--thus these characters may be in the grove but
they aren't part of the content of the elements in which they occur).

An XML processor then operates on the pGrove to produce an XML epGrove in
which the rules for -xml-space are applied, i.e., lists of white space
character nodes in the content of elements where white space gets collapsed
get replaced with single space character nodes. (Notice I didn't say
"characters get replaced", the operations are on nodes in groves, not
characters in strings, and each character is a node.)  

Any location addressing applied against XML documents would, presumably, be
applied against the XML epGrove (or possibly a location-method-specific
grove derived from the epGrove), not against the pGrove.

Of course, the problem of knowing how to produce the XML epGrove
consistently remains.  However, having these two stages can make it clearer
where the processing can happen and *why* using attributes to control it is
not necessarily a hack because the attributes are *not* feeding back into
the base parsing process (at least conceptually)--they are affecting the
construction of application-specific groves and applications are free to
use any information at their disposal to control grove construction.

Note also that grove plans are not sufficient to solve this problem because
grove plans only include or exclude entire classes of object or entire
property values--they can't selectively exclude things: that requires a
specific grove construction process.

Note that there's absolutely no requirement that applications actually
perform the grove constructions described above as discrete steps--most XML
processors will go directly from source data to XML epGrove without first
constructing the intermediate pGrove.  

But note also that HyTime (and DSSSL) can operate with equal ease on either
grove and it could be possible to have both available and indicate which
you actually want to address when doing addressing. (Whether this is
practically useful or not, I wouldn't want to speculate at this point.)

Finally, I'd like to point out that from a HyTime perspective (in the new
grove-based world) any addressing notation that can be defined in terms of
node lists selected from groves can be naturally integrated into a
HyTime-based system.  For example, TEI locators, whose grove-based results
should be obvious given knowedge of the grove plan used, could be easily,
meaningfully, and usefully used in conjunction with other HyTime-defined
location addresses.

Thus, it's not really useful to talk about "HyTime addressing" versus other
forms of address: it's all the same stuff at its core and the problems
posed by the data abstractions we're creating are the same.  Thus the issue
of, for example, whether we should prefer TEI locators over SDQL queries is
an issue of appropriate syntax and user interface, not functionality [for
what it's worth, I will probably end up prefering TEI locators over SDQL
for XML use because it was specifically designed to meet the requirements
we presume the main XML audience to have].

Cheers,

E.
--
W. Eliot Kimber (eliot@isogen.com) 
Senior SGML Consulting Engineer, Highland Consulting
2200 North Lamar Street, Suite 230, Dallas, Texas 75202
+1-214-953-0004 +1-214-953-3152 fax
http://www.isogen.com (work) http://www.drmacro.com (home)
"Rats in the morning, rats in the afternoon...if they don't go away, I'll be
re-educated soon..."                 --Austin Lounge Lizards, "1984 Blues"
Received on Tuesday, 17 December 1996 14:09:16 UTC