Re: RS/RE: basic questions from Charles F. Goldfarb on 1996-09-20 (w3c-sgml-wg@w3.org from September 1996)

From: Charles F. Goldfarb <Charles@SGMLsource.com>
Date: Fri, 20 Sep 1996 02:10:02 GMT
To: Tim Bray <tbray@textuality.com>
Cc: w3c-sgml-wg@w3.org
Message-ID: <3241f046.12343325@mail.alink.net>
On Thu, 19 Sep 1996 13:30:51 +0000, Tim Bray <tbray@textuality.com> wrote:

>Here are two questions, both stated as challenges to assumptions that have
>been stated here but never defended.  I am not necessarily saying that
>the assumptions fail, but that to me at least, their validity is not
>self-evident and that they need some reasoned rhetoric in their support.
>
>1. Why ignore whitespace?
>
>Given really simplistic RS/RE handling, it would be the case that the two 
>following "P" elements would parse differently.
>
><p>Listen to my heart beat.</p>
><p>
>Listen to my heart beat.
></p>
>
>The position has been advanced by both Charles Goldfarb and James Clark
>that this would be A Bad Thing.  Obviously, it would complicate the
>problem of achieving compatibility with 8879.  Aside from that, 
>WHY IS THIS A PROBLEM?
A principal objective of SGML is that all applications should receive the same
"true information" about the document. When an SGML document is created with an
editor that preserves line breaks (which SGML calls "record" breaks to avoid
confusion with formatted output lines), the possibility exists that some record
breaks are not part of the "true information". For example, in 

<p>Listen to my heart beat.
<?DIRECTOR: audio on>
And beat and beat and beat.</p>

the true information is: 

"Listen to my heart beat.
And beat and beat and beat."

because the record end after the PI is not part of the data

Similarly, if the user chose to set the tags off clearly by putting them in
their own records, as in

<p>
Listen to my heart beat.
<?DIRECTOR: audio on>
And beat and beat and beat.
</p>

the true information still would be

"Listen to my heart beat.
And beat and beat and beat."

With a mechanism like SGML's RS/RE handling (properly implemented), the parser
always gives the identical "true information" to the application, regardless of
the user's input style. Without intelligent record handling, in the last example
the application instead sees:

"
Listen to my heart beat.

And beat and beat and beat.
"

These two are very different character strings, so there is no guarantee that
two different products, asked to do the identical processing  will produce
anything close to the same results. Even if the products would have produced
identical results given the same character strings, they cannot do so now.

Making it an "application convention" to strip what appears to be extraneous
whitespace (i.e., to figure out what is the "true information", just shifts the
burden from a few parsers to all applications and increases the chance of
inconsistent treatment). Alternatively, telling the user that he can't put
markup or an included element on a line by itself just shifts the burden to him,
with even more chance of error if he doesn't have a validating editor.

By delimiting all data (which is all that "eliminating mixed content" really
means), you solve all the problems and you don't have to explain complicated
rules to anyone.


>
>It makes it idiotically, wonderfully, easy to explain to programmers 
>*and authors* exactly what is markup and what is data.  It makes it 
>ridiculously easy to implement.  
With my proposal, if it is in quotes it is data, otherwise it is not. Rules
don't get any simpler, or easier to implement.

>
>2. Why should XML try to solve the record problem.?
>
>Personally, I've been writing programs for almost 20 years which routinely
>dealt with the fact that there might be NL or CR or CR/NL sequences in the
>data, and maybe my experience is not shared, but this has never been a big
>problem.  Is it necessary for XML to abstract the problem away in the way
>that SGML tries to do, especially if it's going to be hard to do in XML?
Yes, it is. But it will be very easy to do in XML, much easier than in SGML.

>
>In fact, the practice in UNIX and Microsoft operating systems of storing
>text in chunks of 80 bytes or less, separated by artefacts of typewriter
>technology, is simply a historical anomaly, and I'm not sure that we should
>pander to it, particularly when (and here's the real challenge I guess) it
>doesn't seem, in practice, to be a big problem.
I'm not sure this rant is even relevant. Anyway, perhaps now you see why it is,
in fact, a fatal problem, as it goes to the core of the purpose of SGML: If a
markup language can't preserve and interchange the true information, it is
useless.

Best regards,

Charles
--
Charles F. Goldfarb * Information Management Consulting * +1(408)867-5553
           13075 Paramount Drive * Saratoga CA 95070 * USA
  International Standards Editor * ISO 8879 SGML * ISO/IEC 10744 HyTime
 Prentice-Hall Series Editor * CFG Series on Open Information Management
--
Received on Thursday, 19 September 1996 22:07:53 UTC