Re: RS/RE: Yet Another Proposal from Paul Prescod on 1996-10-03 (w3c-sgml-wg@w3.org from October 1996)

From: Paul Prescod <papresco@calum.csclub.uwaterloo.ca>
Date: Thu, 3 Oct 1996 13:57:29 -0400 (EDT)
To: crm@ebt.com (Christopher R. Maden)
Cc: w3c-sgml-wg@w3.org
Message-Id: <199610031757.NAA19902@calum.csclub.uwaterloo.ca>
While I could live with the proposal to treat all whitespace as significant
(which isn't unprecedented...think of Fortran), I really feel that we've
got to be clear that you either lose the ability to format with whitespace
or you lose control over your "true data". This is the last time, I promise.

> The newlines are really there, as "true data".  They are not
> *displayed*, per application convention dictating whitespace
> normalization after parsing (e.g., in the DSSSL grove).
...
> The newlines go into the database.  All of them.  They are part of the
> data.

The spaces between my table cells cannot be considered part of the True
Data. "True data" is what I have in my head, before I sit down at the 
computer. "True data is meaning." Spaces for formatting are NOT part of that 
true data. They are meaningless. I put them as part of the encoding
process as an encoding convenience.  They have no meaning in ANY output
format or as the result of ANY database query. They are meaningless in
all but two contexts;

#1. My raw text editor, where they were created.
#2. An XML application that treats them as meaningful because it doesn't have
a stylesheet, doesn't know what "application conventions" to apply and doesn't 
know what to do with them.

> o All new lines are data, except those known to be in element content
>   (by virtue of SEPCHAR).

I don't understand. How can I know which ones are in element content?

> o When formatting XML (for display, transformation into RTF, or
>   printing):
> ...

I think that any syntactic specification that subscribes behaviour for
a _particular application_ is in trouble. You've specified a mechanism for 
a particular class of applications to "figure out" what was meant, but not for 
the larger set of ALL applications.

This is acceptable if we do not expect people to ever encode their data in
XML, but not otherwise. If XML is really, truly _delivery and display only_
then this is fair. Otherwise, not.

> The problem is that true content, without one hack or another, is
> different between an SGML parse and an XML parse.  Quoting is going to
> make XML unusable, IMO.  By making *all* newlines data, handling is
> unambiguous.  

Making ALL newlines (outside of verbatim elements) NOT data is also 
unambiguous, but preserves the SGML/HTML convention of using whitespace for
formatting without affecting the parse tree.

> An SGML (or XML with DTD) parse will not be ESIS-
> identical to an XML parse without DTD, but after application
> conventions are applied, the result will be identical.  Isn't that
> what matters?

The problem is what you describe as "application conventions" are conventions
at the level ABOVE the XML parser (as opposed to implementing XML as a set
of SGML application conventions). So you are depending on "smart applications".
A "stupid" application should be able to take the outputs of a rigorous parse
and work on them without fear of extraneous, meaningless data.

> Alternatives:
> 1) Implement a shortref-based hack that won't work in most current
>    SGML systems and complicates the markup, for a reason that won't be
>    explainable to most users or implementers.
> 2) Define a simple application convention that won't work in most
>    current SGML systems, simplifies markup, and is easy to explain.

3) Make all newlines and tabs meaningless except those that are lexically
distinguished either through a section delimiter, an escape character, an
entity reference or a simple GI. (we have many options of how to make them
obvious) Make all spaces meaningful. 

 * Most existant documents will parse immediately. 
 * The output of SGML normalizers will parse immediately. 
 * The output of almost all SGML tools will parse immediately. 
 * The markup looks basically identical to that which the SGML community is 
   accustomed to.

cons: 
 * incautious use of space characters as formatting or newlines "as" space
characters will fail.

BUT, where and when was it written that a newline and a space are equivalent
anyhow? Isn't that just a bad habit, a subtle abuse of markup? I think that
Lee made this point before. If you want a space, put in a space. If you want
a newline, put in a newline.

> > editors do this for you automatically. On both Windows and the Mac,
> > the standard text editor widgets Do the Right Thing.
> 
> They do?  I've never had Notepad, Write, Wordpad, SimpleText, or
> Claris Works insert spaces at the end of lines for me.  i think I
> would be upset if they did when I wasn't editing XML.  And I don't see
> them implementing an XML mode any time soon.

Sorry I wasn't more explicit. If you insert a newline meaning to insert a 
space, none of these editors will put in another space for you. But why
would you do that? 

If, on the other hand, you keep typing past the end of a line, all of these 
tools will word wrap for you. Some of them will insert newlines. Some will
not. NONE will STRIP the space you just typed, before you knew (or cared)
that the line was going to wrap. Therefore, unless you treat your text
editor as a typewriter (which I admit, I do, in vi, because it does not
do automatic word wrapping) you will be okay. Emacs seems to also do 
the right thing. Distressingly, (and surprisingly) fmt does not. vi users 
would have to be careful and put spaces before every newline, but anyone using 
a text editor written in the last two decades shouldn't have a problem.

Lee pointed out that mail programs often strip trailing spaces (I don't know
why they would, but...). This is a further restriction.

One other interesting thing: this is VERY easy to warn the user about in a 
validating parser (with DTD) "Warning, there is no space between lines 20 and 
21 in mixed content. Lines will be concatenated."

If we don't trust editors, there are other alternatives. We could make
the start tags for mixed content elements lexically distinct. We could make a
rule that says that a newline becomes a space if the element has had data 
comment in it already.

That fails in obscure cases like this:

<P><EM>This</EM>
<STRONG>is</STRONG> a document.

but the solution is simple: put a space in, instead of (or in addition to)
a newline. Again, this can be easily checked.

 Paul Prescod
Received on Thursday, 3 October 1996 13:57:46 UTC