- From: Jean Paoli <jeanpa@microsoft.com>
- Date: Fri, 13 Dec 1996 10:30:48 -0800
- To: "'w3c-sgml-wg@w3.org'" <w3c-sgml-wg@w3.org>
the problem I have with RE Delenda is the one pointed by Prescod: there is no mechanism provided for having totally meaningless whitespaces. It is a fact that editors or batch applications which read XML, manipulate it and then save it need a way to freely insert CR/LF inside the XML stream in order to cut it in lines because a line has a limited number of characters. It need also a place to freely insert whitespaces for indentation purpose. It need also to know at read-time that those characters could be removed : if not, the document will continue to grow indefinitely. This is the tribute we have to pay to enable XML document be text, which means human readable without tools. In principle, the only safe place to insert or delete such characters is in element-content but element-content cannot be detected in a DTD-less environment so we have a problem. Instead of talking immediately on what a parser should do, I will discus a little bit about application behavior and come back later to the parser output. In the following, * represent a CR/LF/whitespace and are numbered *1, *2 : <UL>*1*2<LI><PARA><B></B>*3*4<I></I></PARA><PRE>foo*5*6bar</PRE></LI></U L> *1 and *2 are meaningless *3 and *4 are white space in mixed content. *5 and *6 are white spaces used for formatting purpose. When an application saves XML, we can realize that it *could* be acceptable to collapse *3 and *4 or to add or delete CR/LF near *3 or *4 in order to do pretty printing or to cut the XML stream in lines. It *could* also be acceptable to collapse *1 and *2 or to add or delete CR/LF near *1 and *2. But it is certainly inacceptable to let the application modify anything for *5 and *6 or insert CR/LF inside of the <PRE> in order to cut the XML stream in lines. I propose : 1/ XML Parser output: use RE Delenda Est in order to respect the data integrity (and for example to permit to full-text indexers to know where everything is by byte offset) 2/ Change the current -XML-SPACE meaning: instead of having -XML-SPACE changing the output of the parser, let us define -XML-SPACE as information for the application (basically XML is an application profile of SGML, so hey, we go!): we define a single value -XML-SPACE=PRESERVE in order to indicate to the application that it should not mess *arbitrarely* with the content of the element for prettyprinting or line cut purpose. Styles are not appropriate to carry this information. Such information is related to the content and does not have anything to do with the presentation. Using styles in this context means that when you apply two different styles sheets to the same XML, some whitespaces/CR/LF could be inserted arbitrarely inside a <PRE>-like tag, depending of the style-sheet it happened you were using at save time. This is wrong. Such information should be in the XML data itself. Note: My take is that this proposal is less convenient than the current XML spec. It gives full access to byte offset out of the parser but permit to different applications -even if there is some hints to protect regions- to collapse in a ton of different ways white spaces (such as *3*4 in my example) inside of PCDATA.The only remaining problem in my eyes to the current spec is that white spaces/RS/RE like *1*2 are collapsed and not deleted. But style sheets could do that and there is no perfect solution anyway.
Received on Friday, 13 December 1996 13:30:54 UTC