- From: Charles F. Goldfarb <Charles@SGMLsource.com>
- Date: Mon, 23 Sep 1996 23:17:31 GMT
- To: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
- Cc: Paul Prescod <papresco@calum.csclub.uwaterloo.ca>, W3C SGML Working Group <w3c-sgml-wg@w3.org>
On Mon, 23 Sep 96 09:26:37 CDT, Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU> wrote: >Are the RE rules *essential* to good reprsentation of textual data? > >I haven't heard anyone argue this position, only other positions: Intelligent RE handling is *essential* to good (=accurate) document representation because it is the only way to distinguish the true information content from the "source document formatting"; that is, from the rendition of the document that is created in the text editor when you enter the document. A principal objective of SGML is that all applications should receive the same "true information" about the document. When an SGML document is created with an editor that preserves line breaks (which SGML calls "record" breaks to avoid confusion with formatted output lines), the possibility exists that some record breaks are not part of the "true information". For example, in <p>Listen to my heart beat. <?DIRECTOR: audio on> And beat and beat and beat.</p> the true information is: "Listen to my heart beat. And beat and beat and beat." because the record end after the PI is not part of the data Similarly, if the user chose to set the tags off clearly by putting them in their own records, as in <p> Listen to my heart beat. <?DIRECTOR: audio on> And beat and beat and beat. </p> the true information still would be "Listen to my heart beat. And beat and beat and beat." With a mechanism like SGML's RS/RE handling (properly implemented), the parser always gives the identical "true information" to the application, regardless of the user's input style. Without intelligent record handling, in the last example the application instead sees: " Listen to my heart beat. And beat and beat and beat. " These two are very different character strings, so there is no guarantee that two different products, asked to do the identical processing will produce anything close to the same results. Even if the products would have produced identical results given the same character strings, they cannot do so now. Making it an "application convention" to strip what appears to be extraneous whitespace (i.e., to figure out what is the "true information", just shifts the burden from a few parsers to all applications and increases the chance of inconsistent treatment). Alternatively, telling the user that he can't put markup or an included element on a line by itself just shifts the burden to him, with even more chance of error if he doesn't have a validating editor. -- Charles F. Goldfarb * Information Management Consulting * +1(408)867-5553 13075 Paramount Drive * Saratoga CA 95070 * USA International Standards Editor * ISO 8879 SGML * ISO/IEC 10744 HyTime Prentice-Hall Series Editor * CFG Series on Open Information Management --
Received on Monday, 23 September 1996 19:15:37 UTC