- From: Paul Prescod <papresco@calum.csclub.uwaterloo.ca>
- Date: Fri, 13 Dec 1996 20:11:45 -0500
- To: "'w3c-sgml-wg@w3.org'" <w3c-sgml-wg@w3.org>
At 10:30 AM 12/13/96 -0800, Jean Paoli wrote: >I propose : > >1/ XML Parser output: use RE Delenda Est in order to respect the data >integrity (and for example to permit to full-text indexers to know where >everything is by byte offset) I don't understand this byte offset concern. There are other places to put information that is totally meaningless: whitespace within tags and all data within comments, for example. Full-text indexers have totally different needs than "applications" in the parser-client sense. Most parser-clients want a simplification of the data, not just a tokenization of it. If we take support for full-text indexers (and other tools that want all of the bytes) to the extreme, then we must build a grove that contains all of the whitespace within tags, etc. Tim Bray says that it is useful to pass on all of the content that is not markup, but I don't see how. What does this "buy?" How is it any easier to work with an unnormalized version of this: <P> Whitespace </P> than this: <P><!-- -->Comments<!-- --></p> If we can agree that they are equally difficult to use in a text processing system, then we can agree that whitespace removal can be handled in the same way that comment removal is, by an optional "extra" communication line between the parser and the application. Paul Prescod
Received on Friday, 13 December 1996 20:08:42 UTC