Mixed vs. element content (Was Re: RS/RE, again (sorry)) from Derek Denny-Brown on 1996-12-17 (w3c-sgml-wg@w3.org from December 1996)

From: Derek Denny-Brown <ddb@criinc.com>
Date: Mon, 16 Dec 1996 17:48:37 -0800
To: w3c-sgml-wg@w3.org
Message-Id: <2.2.32.19961217014837.006e953c@MAILHOST.criinc.com>
Joe English wrote:
>Christopher R. Maden <crm@ebt.com> wrote:
>> 3) A dichotomy between "DTD-ful" and DTD-less parsing will make any
>>    sibling-based relationship difficult at best; this will affect some
>>    TEI or HyQ based hyperlinks, as well as sibling-based stylistic
>>    decisions.
>
>However, it will _not_ cause a problem for a large class of
>applications; namely those in which extraneous whitespace
>is irrelevant (full-text indexers, web-walkers, etc.) or
>is dealt with separately by the application (CSS-based
>formatters, any formatter based on TeX or TeX-like algorithms,
>etc.)
>
>Extraneous whitespace _can_ cause a problem for DSSSL-based formatters,
>but I don't think that's a big problem either (please see below).
>
>> 4) The only way to avoid the dichotomy is to preserve these whitespace
>>    nodes even when a DTD is present.
>
>That's one way to avoid it, but it's not the _only_ way.
>I propose the following:
>
>    * If the author fails to provide a formal document type
>      definition, then all whitespace is assumed to be significant
>      (as per (1), above).
>
>    * If the author does provide a formal document type definition,
>      then applications must examine the DTD to determine where
>      whitespace is significant.
>
>I think this will work.  A large class of applications simply
>_do not care_ about the distinction between element and mixed
>content: they will produce the same results in either case.
>Applications in that class do not need to examine the DTD.

I see this issue raising a distinction not unlike the issue addressed by the
'Required Markep Declaration' already provided for in the spec.
<http://www.w3.org/pub/WWW/TR/WD-xml-961114.html#NT-RMDecl>

Where the RMD says what is required to parse this document properly (or at
all), we are now talking about what is required to properly parse
white-space.  I don't really like the -xml-space attribute, it does a
reasonable job os the simple cases.  (It still is a bit rough though, as
this whole discussion has pointed out.)   When a document (or it's DTD) is
too complex for this simple tool to handle it, the document should just be
able to demand that the DTD be processed in order for white space to be
parsed properly.  A DSSSL display engine would tell it's XML parser that it
cares about white space, and the XML parser might thus be required to
retrieve the DTD when it might not otherwise have needed it.  Similarly, a
server which servers complex XML documents, might normalize the documents
such that white space processing was not neccessary (assuming it has access
to the DTD).

Somewhere, someone has to take into account white-space/RE handling.  The
indication is that the user can not always be expect to take on this burden,
so someone else must carry it, in those cases where is significant.

>Applications that do care about the distinction will need
>to examine the DTD.  I do not see this as an onerous burden
>for any of the applications in this class that have been
>mentioned so far: all the ones listed so far will need to
>examine the DTD anyway (in order to process general entity
>declarations), and compared to the effort involved in (say)
>writing a DSSSL-O engine, parsing content models to see if
>they contain a #PCDATA token is just not that difficult.

This assumes that the issue with DTD's is the complexity of parsing and
processing the DTD.  If the issue is the bandwidth required for passing the
DTD, then these arguements do not hold.  Unfortunately, I have encoundered
many documents where the DTD is an order of magnitude larger than the
instance.  (Whether this is bad design on the part of SGML or the document
architect is irrelevant.  I still believe it may be common)

I am concerned that a document parsed as XML (as opposed to SGML) will
require different HyTime location addressing, or a different DSSSL style
sheet, than if the same document was parsed as SGML.  Such a situation
defeats a number of the goals of XML, mainly it's SGML compatibility.

-derek
"that which is not slightly distorted lacks sensible appeal: from which it
follows
 that irregularity - that is to say, the unexpected, surprise, and astonishment,
    are an essential part and characteristic of beauty" - Charles Baudelaire
Received on Monday, 16 December 1996 20:52:54 UTC