- From: Arkin <arkin@trendline.co.il>
- Date: Wed, 24 Feb 1999 11:39:38 -0500
- To: www-dom@w3.org
A question that still buffles me, so comments on the following are more than welcome: 1. An HTML processor is a very specific case of an XML processor and must know something about HTML to process it right. This information falls outside the XML DTD. For example, a Web browser may assume that any text inside a table belongs in some row, thus, <TABLE> <TR> Explicit row </TR> Implicit row <TABLE> is equivalent to <TABLE><TR>Explicit row</TR><TR>Implicit row</TR></TABLE> 2. PRE, STYLE and SCRIPT are specific cases in HTML, unlike other elements. They are whitespace preserving and do not process elements in their content. Per the specification, PRE, STYLE and SCRIPT ignore the first and last whitespace in the contents, which is considered to be human readable whitespace, but preserve all other whitespaces inside. 3. All other HTML elements ignore the first and last whitespace and consolidate mutliple whitespaces into a single space (). Thus, if you need multiple whitespaces, you have to provide them as character reference (&#A0;). Newlines are always translated into a space in HTML outside of PRE/STYLE/SCRIPT, you have to use <BR> to force a new line. 4. With a validating XML processor, XML elements should preserve whitespaces only if the 'xml:space' attribute has a value of 'preserve', otherwise they may lose whitespaces by ignoring the trailing and leading whitespaces and consolidating multiple whitespaces to a single space (). Again, whitespace is assumed to be for human readbility. Thus, <tree> <node> Some text </node> </tree> is the hand written equivalent of, <tree><node>Some text</node></tree> and not equivalent to: <tree>�a;<node>�a;Some text.... 5. With a validating XML processor, XML elements that have non-mixed content type (only elements, no text) should ignore all whitespaces and flag an error for any other text that appears in between elements. 6. Without a validating XML processor, XML elements should attempt to ignore as much whitespace as possible, regarding it as human readable whitespace. The last rule is important if you consider that: a) hand written XML documents will contain much redundant whitespace, b) in 9 out of 10 cases this whitespace has no meaning, c) most DOM applications will not handle whitespace correctly, d) whitespace can always be introduced explicitly with character references. Arkin > The question is: what happens to white spaces like newlines? > According to a DTD I can decide to ignore them. > But if I consider an element like PRE in HTML then newlines and spaces > are important, and should be accessible via a Text node. > > In fact xml4j creates with its (validating) DOMParser a structure like > > Element TABLE > Text "\n" > Element TBODY > Text "\n" > Element TR > ... > Text "\n" > Element TR > ... > Text "\n" > Text "\n" > > My first opinion was that those Text nodes "\n" are unnecessary. > But after thinking it over I am a little bit in doubt ... > > Any comments? > > Regards, > Oliver > > /-------------------------------------------------------------------\ > | ob|do Dipl.Inf. Oliver Becker | > | --+-- E-Mail: obecker@informatik.hu-berlin.de | > | op|qo WWW: http://www.informatik.hu-berlin.de/~obecker | > \-------------------------------------------------------------------/
Received on Wednesday, 24 February 1999 11:45:36 UTC