Mixed vs. element content (Was Re: RS/RE, again (sorry))

Christopher R. Maden <crm@ebt.com> wrote:

> 1) In the absence of a DTD, we must assume mixed content for
>    everything.[*]

With you so far.


> 2) This could create whitespace nodes in element content.

I don't think so: if all element types have mixed content, then
there is no element content in which whitespace nodes could appear!


Actually there are two cases to consider:

    (A) The document creator does not provide a formal document type
	definition.

In this case, the semantics stated in point (1) should apply, and by my
reasoning there can be no whitespace in element content to worry
about.

    (B) The document owner provides a formal document type definition,
	but the document consumer chooses to ignore it.

This is the only case where I see a problem, but I don't think
it's a very big one.  I'll get to that in a moment...


> 3) A dichotomy between "DTD-ful" and DTD-less parsing will make any
>    sibling-based relationship difficult at best; this will affect some
>    TEI or HyQ based hyperlinks, as well as sibling-based stylistic
>    decisions.

However, it will _not_ cause a problem for a large class of
applications; namely those in which extraneous whitespace
is irrelevant (full-text indexers, web-walkers, etc.) or
is dealt with separately by the application (CSS-based
formatters, any formatter based on TeX or TeX-like algorithms,
etc.)

Extraneous whitespace _can_ cause a problem for DSSSL-based formatters,
but I don't think that's a big problem either (please see below).

> 4) The only way to avoid the dichotomy is to preserve these whitespace
>    nodes even when a DTD is present.

That's one way to avoid it, but it's not the _only_ way.
I propose the following:

    * If the author fails to provide a formal document type
      definition, then all whitespace is assumed to be significant
      (as per (1), above).

    * If the author does provide a formal document type definition,
      then applications must examine the DTD to determine where
      whitespace is significant.

I think this will work.  A large class of applications simply
_do not care_ about the distinction between element and mixed
content: they will produce the same results in either case.
Applications in that class do not need to examine the DTD.

Applications that do care about the distinction will need
to examine the DTD.  I do not see this as an onerous burden
for any of the applications in this class that have been
mentioned so far: all the ones listed so far will need to
examine the DTD anyway (in order to process general entity
declarations), and compared to the effort involved in (say)
writing a DSSSL-O engine, parsing content models to see if
they contain a #PCDATA token is just not that difficult.

Back to case (A): if authors choose not to include a formal
document type definition, then it will be up to them to
take care of extraneous whitespace.  There are several options;
they could include whitespace-discarding rules in all relevant
stylesheets, format their documents so that no extraneous
whitespace appears in the first place, or provide a DTD,
to name three.


> 5) Since SEPCHAR is thrown away in element content, every element must
>    be made mixed content, and any element declaration without #PCDATA
>    is illegal.
>
> This is clearly unacceptable.

Agreed, especially considering the production for "Mixed-content
declaration" in section 3.2.1 of the XML spec, which limits
content models containing #PCDATA to repeatable OR groups.


--Joe English

  jenglish@crl.com

Received on Monday, 16 December 1996 18:52:10 UTC