Whitespace and full-text indexers

At 10:30 AM 12/13/96 -0800, Jean Paoli wrote:
>I propose :
>
>1/ XML Parser output: use RE Delenda Est in order to respect the data
>integrity (and for example to permit to full-text indexers to know where
>everything is by byte offset)  

I don't understand this byte offset concern.

There are other places to put information that is totally meaningless:
whitespace within tags and all data within comments, for example. Full-text
indexers have totally different needs than "applications" in the
parser-client sense. Most parser-clients want a simplification of the data,
not just a tokenization of it. If we take support for full-text indexers
(and other tools that want all of the bytes) to the extreme, then we must
build a grove that contains all of the whitespace within tags, etc.

Tim Bray says that it is useful to pass on all of the content that is not
markup, but I don't see how. What does this "buy?" How is it any easier to
work with an unnormalized version of this:

<P>


Whitespace


</P>

than this:

<P><!--

-->Comments<!--

--></p>

If we can agree that they are equally difficult to use in a text processing
system, then we can agree that whitespace removal can be handled in the same
way that comment removal is, by an optional "extra" communication line
between the parser and the application.

 Paul Prescod

Received on Friday, 13 December 1996 20:08:42 UTC