Re: RS/RE: basic questions from Charles F. Goldfarb on 1996-09-23 (w3c-sgml-wg@w3.org from September 1996)

From: Charles F. Goldfarb <Charles@SGMLsource.com>
Date: Mon, 23 Sep 1996 23:17:31 GMT
To: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
Cc: Paul Prescod <papresco@calum.csclub.uwaterloo.ca>, W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-ID: <324715e1.13665565@mail.alink.net>

On Mon, 23 Sep 96 09:26:37 CDT, Michael Sperberg-McQueen
<U35395@UICVM.CC.UIC.EDU> wrote:

>Are the RE rules *essential* to good reprsentation of textual data?
>
>I haven't heard anyone argue this position, only other positions:

Intelligent RE handling is *essential* to good (=accurate) document
representation because it is the only way to distinguish the true information
content from the "source document formatting"; that is, from the rendition of
the document that is created in the text editor when you enter the document.

A principal objective of SGML is that all applications should receive the same
"true information" about the document. When an SGML document is created with an
editor that preserves line breaks (which SGML calls "record" breaks to avoid
confusion with formatted output lines), the possibility exists that some record
breaks are not part of the "true information". For example, in 

<p>Listen to my heart beat.
<?DIRECTOR: audio on>
And beat and beat and beat.</p>

the true information is: 

"Listen to my heart beat.
And beat and beat and beat."

because the record end after the PI is not part of the data

Similarly, if the user chose to set the tags off clearly by putting them in
their own records, as in

<p>
Listen to my heart beat.
<?DIRECTOR: audio on>
And beat and beat and beat.
</p>

the true information still would be

"Listen to my heart beat.
And beat and beat and beat."

With a mechanism like SGML's RS/RE handling (properly implemented), the parser
always gives the identical "true information" to the application, regardless of
the user's input style. Without intelligent record handling, in the last example
the application instead sees:

"
Listen to my heart beat.

And beat and beat and beat.
"

These two are very different character strings, so there is no guarantee that
two different products, asked to do the identical processing  will produce
anything close to the same results. Even if the products would have produced
identical results given the same character strings, they cannot do so now.

Making it an "application convention" to strip what appears to be extraneous
whitespace (i.e., to figure out what is the "true information", just shifts the
burden from a few parsers to all applications and increases the chance of
inconsistent treatment). Alternatively, telling the user that he can't put
markup or an included element on a line by itself just shifts the burden to him,
with even more chance of error if he doesn't have a validating editor.

--
Charles F. Goldfarb * Information Management Consulting * +1(408)867-5553
           13075 Paramount Drive * Saratoga CA 95070 * USA
  International Standards Editor * ISO 8879 SGML * ISO/IEC 10744 HyTime
 Prentice-Hall Series Editor * CFG Series on Open Information Management
--

Received on Monday, 23 September 1996 19:15:37 UTC