Whitespace from Peter Murray-Rust on 1997-05-07 (w3c-sgml-wg@w3.org from May 1997)

From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
Date: Wed, 07 May 1997 11:34:16 GMT
To: w3c-sgml-wg@w3.org
Message-Id: <6269@ursus.demon.co.uk>
We have been asked to concentrate on XML-lang and xml-link ... so.  

I am finding myself thoroughly confused by whitespace handling.  Although
I suspect that the draft is consistent and the ERB/WG all agree on what
it's meant to do, please treat the following as a typical webhacker confusion.
It may be that special explanation is required in the draft, because otherwise
the HTML2XML community will be thoroughly confused.

I will use a very simplified subset of CML as illustration.  (I will omit
minimisation flags to save typing - otherwise all examples should be 
interchangeable between SGML and XML).  Please also forgive errors.

The document:
<?XML VERSION="1.0"?>
<!DOCTYPE CML [
<!ELEMENT CML (XVAR)*>
<!ELEMENT XVAR (#PCDATA)>
]>
<CML>
<XVAR>
A variable
</XVAR>
</CML>

parses with sgmls to give a CML element which contains an XVAR element
whose content is 'A variable'.  There are no other #PCDATA elements.

I can include as much whitespace (space, newline) between tags as I like and
the result is the same.

If I write
<!ELEMENT CML (#PCDATA|XVAR)*>
instead, it also validates, but gives a different result, with additional
#PCDATA elements (content '\n') on either side of the XVAR element.  If I
use ANY as the content model of CML it does the same.

NXP appears to do the same as sgmls on a cursory inspection, even without
validation switched on.  

Let's move to WF mode...

If ELEMENTS are removed from the DTD subset (or there is no DTD at all, then
they are assumed to have content model of ANY.  ***This will result in 
additional PCDATA nodes in the tree***.

This is doubtless not news to any of you, but it's a shock to me, that
WF documents and validated documents ***GIVE DIFFERENT OUTPUT***.  I am sure
that this will be a rich source of confusion.

Ideally I would like to add an XML-LANG option that 'fixed' the problem, but 
I'm not sure it's fixable :-)  CML is straightforward in that PCDATA only 
occurs in two elements in the DTD, and those can only have #PCDATA.  I can 
throw out 'spurious' PCDATA nodes later, but that seems to me to be 
DTD-dependent and we don't have a flag that can signal this.  So 
(expecting the answer 'no')

**is there any way to modify XML-lang to suppress PCDATA elements having
only whitespace content in certain contexts?**  (I thought that was the 
original intent of the first draft).

So far we agree on the ***present output for the parser*** if we can't 
change XML-lang.  It differs according to whether the parser uses some
or all of the DTD, even for a WF document.  What do we do with what we get?
The spec is not very much help.  PRESERVE says 'take exactly what you get.
That's what the author+DTD wants you to have'.  DEFAULT says 'up to the
application', which doesn't help the implementer.  I still find the terms
'application', 'parser' and 'processor' are not clear in my mind, and it
is further confused by the common usage 'HTML is an *application* of SGML'.

[BTW - has the hanging sentence in 2.8 been modified?]

I am assuming that the 'application' is a program, distinct from the 
parser (which is a 'processor'?) and that JUMBO is an application (a generic
one).  Therefore it's up to **JUMBO** what it does with DEFAULT, right?
This is independent of the DTD, and the DTD author, and author of the WF
document can have no control over the way DEFAULT is implemented.

It's quite possible that some applications could decide to throw out (delete
from tree) the 'spurious' PCDATA, while others might collapse it to a single
space, others to a null string and some simply use PRESERVE (as JUMBO does).
The author and the DTD have no control over this.  

This has serious consequences for WF documents because although this is 
strictly logical it's anything but intuitive.  It means that a document
like the one above is highly dangerous without its DTD, which seems a pity
because it's eminently useful.  If all CML documents have to be presented as

<CML><XVAR>A variable</XVAR></CML>

this is unworkable, since this can run to tens of thousands of characters 
without a line break and this breaks text editors.  And remember
(Article 6) 'human-legible and reasonably clear'.

Ideally we need a fix for this.  If none is possible, then we need a VERY
clear exposition of this.  It also means that non-validating parsers
(or at least parsers which cannot read the DTD) will give different outputs
from validating parsers.  I have run Lark briefly over the top file, Tim,
and my impression is that Lark puts in the 'spurious' PCDATA nodes, whilst
NXP doesn't.  [Forgive me if I'm wrong here, Tim].  This would imply that it's
possible to get parsers that give different output, different output on
validation/non_validation, different output with (no)DTD subset, and 
different treatment of DEFAULT by browsers.  [This is with a properly 
well-formed document with balanced quotes, tags, and the rest :-). ]

IMO this gives people an awful lot of places to go wrong.  However the 
solution is not to get rid of WF documents, as has been suggested, but to
make these aspects of behaviour much clearer.

	P.

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/


-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/
Received on Wednesday, 7 May 1997 06:40:41 UTC