RE: RS/RE, again (sorry)
the problem I have with RE Delenda is the one pointed by Prescod:
there is no mechanism provided for having totally meaningless
It is a fact that editors or batch applications which read XML,
and then save it need a way to freely insert CR/LF inside the XML stream
in order to cut it in lines because a line has a limited number of
It need also a place to freely insert whitespaces for indentation
It need also to know at read-time that those characters could be removed
if not, the document will continue to grow indefinitely.
This is the tribute we have to pay to enable XML document be text, which
human readable without tools.
In principle, the only safe place to insert or delete such characters is
but element-content cannot be detected in a DTD-less environment so we
have a problem.
Instead of talking immediately on what a parser should do, I will discus
a little bit
about application behavior and come back later to the parser output.
In the following, * represent a CR/LF/whitespace and are numbered *1,
*1 and *2 are meaningless
*3 and *4 are white space in mixed content.
*5 and *6 are white spaces used for formatting purpose.
When an application saves XML, we can realize that it *could* be
to collapse *3 and *4 or to add or delete CR/LF near *3 or *4 in order
to do pretty printing or
to cut the XML stream in lines.
It *could* also be acceptable to collapse *1 and *2 or to add or delete
CR/LF near *1 and *2.
But it is certainly inacceptable to let the application modify anything
for *5 and *6 or insert CR/LF
inside of the <PRE> in order to cut the XML stream in lines.
I propose :
1/ XML Parser output: use RE Delenda Est in order to respect the data
integrity (and for example to permit to full-text indexers to know where
everything is by byte offset)
2/ Change the current -XML-SPACE meaning: instead of having -XML-SPACE
the output of the parser, let us define -XML-SPACE as information for
(basically XML is an application profile of SGML, so hey, we go!): we
define a single value
-XML-SPACE=PRESERVE in order to indicate to the application that it
should not mess
*arbitrarely* with the content of the element for prettyprinting or line
Styles are not appropriate to carry this information. Such information
to the content and does not have anything to do with the presentation.
Using styles in this context means that when you apply two different
to the same XML, some whitespaces/CR/LF could be inserted arbitrarely
a <PRE>-like tag, depending of the style-sheet it happened you were
using at save time.
This is wrong. Such information should be in the XML data itself.
My take is that this proposal is less convenient than the current XML
It gives full access to byte offset out of the parser but permit to
-even if there is some hints to protect regions- to collapse in a ton of
white spaces (such as *3*4 in my example) inside of PCDATA.The only
problem in my eyes to the current spec is that white spaces/RS/RE like
*1*2 are collapsed
and not deleted. But style sheets could do that and there is no perfect