RE: RS/RE, again (sorry) from Jean Paoli on 1996-12-13 (w3c-sgml-wg@w3.org from December 1996)

From: Jean Paoli <jeanpa@microsoft.com>
Date: Fri, 13 Dec 1996 10:30:48 -0800
To: "'w3c-sgml-wg@w3.org'" <w3c-sgml-wg@w3.org>
Message-ID: <c=US%a=_%p=msft%l=RED-16-MSG-961213183048Z-7215@INET-04-IMC.itg.microsoft.com>
the problem I have with RE Delenda is the one pointed by Prescod:
there is no mechanism provided for having totally meaningless
whitespaces.
It is a fact that editors or batch applications which read XML,
manipulate it
and then save it need a way to freely insert CR/LF inside the XML stream
in order to cut it in lines because a line has a limited number of
characters.
It need also a place to freely insert whitespaces for indentation
purpose.
It need also to know at read-time that those characters could be removed
:
if not, the document will continue to grow indefinitely.
This is the tribute we have to pay to enable XML document be text, which
means
human readable without tools.
In principle, the only safe place to insert or delete such characters is
in element-content
but element-content cannot be detected in a DTD-less environment so we
have a problem.

Instead of talking immediately on what a parser should do, I will discus
a little bit
about application behavior and come back later to the parser output.

In the following, * represent a CR/LF/whitespace and are numbered  *1,
*2 :

<UL>*1*2<LI><PARA><B></B>*3*4<I></I></PARA><PRE>foo*5*6bar</PRE></LI></U
L>

*1 and *2 are meaningless
*3 and *4 are white space in mixed content. 
*5 and *6 are white spaces used for formatting purpose.

When an application saves XML, we can realize that it *could* be
acceptable 
to collapse *3 and *4 or to add or delete CR/LF near *3 or *4 in order
to do pretty printing or 
to cut the XML stream in lines.
It *could* also be acceptable to collapse *1 and *2 or to add or delete
CR/LF near *1 and *2.
But it is certainly inacceptable to let the application modify anything
for *5 and *6 or insert CR/LF
inside of the <PRE> in order to cut the XML stream in lines.

I propose :

1/ XML Parser output: use RE Delenda Est in order to respect the data
integrity (and for example to permit to full-text indexers to know where
everything is by byte offset)  
2/ Change the current -XML-SPACE meaning: instead of having -XML-SPACE
changing
the output of the parser, let us define -XML-SPACE as information for
the application 
(basically XML is an application profile of SGML, so hey, we go!): we
define a single value
-XML-SPACE=PRESERVE in order to indicate to the application that it
should not mess
*arbitrarely* with the content of the element for prettyprinting or line
cut purpose.

Styles are not appropriate to carry this information. Such information
is related 
to the content and does not have anything to do with the presentation.
Using styles in this context means that when you apply two different
styles sheets
to the same XML, some whitespaces/CR/LF could be inserted arbitrarely
inside
a <PRE>-like tag, depending of the style-sheet it happened you were
using at save time.
This is wrong. Such information should be in the XML data itself.

Note:

My take is that this proposal is less convenient than the current XML
spec.
It gives full access to byte offset out of the parser but permit to
different applications 
-even if there is some hints to protect regions- to collapse in a ton of
different ways
white spaces (such as *3*4 in my example) inside of PCDATA.The only
remaining 
problem in my eyes to the current spec is that white spaces/RS/RE like
*1*2 are collapsed
and not deleted. But style sheets could do that and there is no perfect
solution anyway.
Received on Friday, 13 December 1996 13:30:54 UTC