Re: "Empty" Text Nodes

A question that still buffles me, so comments on the following are more
than welcome:

1. An HTML processor is a very specific case of an XML processor and
must know something about HTML to process it right. This information
falls outside the XML DTD. For example, a Web browser may assume that
any text inside a table belongs in some row, thus,

<TABLE>
  <TR>
    Explicit row
  </TR>
  Implicit row
<TABLE>

is equivalent to

<TABLE><TR>Explicit row</TR><TR>Implicit row</TR></TABLE>

2. PRE, STYLE and SCRIPT are specific cases in HTML, unlike other
elements. They are whitespace preserving and do not process elements in
their content. Per the specification, PRE, STYLE and SCRIPT ignore the
first and last whitespace in the contents, which is considered to be
human readable whitespace, but preserve all other whitespaces inside.

3. All other HTML elements ignore the first and last whitespace and
consolidate mutliple whitespaces into a single space (&#20;). Thus, if
you need multiple whitespaces, you have to provide them as character
reference &nbsp; (&#A0;). Newlines are always translated into a space in
HTML outside of PRE/STYLE/SCRIPT, you have to use <BR> to force a new
line.

4. With a validating XML processor, XML elements should preserve
whitespaces only if the 'xml:space' attribute has a value of 'preserve',
otherwise they may lose whitespaces by ignoring the trailing and leading
whitespaces and consolidating multiple whitespaces to a single space
(&#20;). Again, whitespace is assumed to be for human readbility.

Thus,

<tree>
  <node>
    Some text
  </node>
</tree>

is the hand written equivalent of,

<tree><node>Some text</node></tree>

and not equivalent to:

<tree>&#0a;&#20;&#20;<node>&#0a;&#20;&#20;&#20;&#20;Some text....

5. With a validating XML processor, XML elements that have non-mixed
content type (only elements, no text) should ignore all whitespaces and
flag an error for any other text that appears in between elements.

6. Without a validating XML processor, XML elements should attempt to
ignore as much whitespace as possible, regarding it as human readable
whitespace.

The last rule is important if you consider that: a) hand written XML
documents will contain much redundant whitespace, b) in 9 out of 10
cases this whitespace has no meaning, c) most DOM applications will not
handle whitespace correctly, d) whitespace can always be introduced
explicitly with character references.


Arkin

> The question is: what happens to white spaces like newlines?
> According to a DTD I can decide to ignore them.
> But if I consider an element like PRE in HTML then newlines and spaces
> are important, and should be accessible via a Text node.
> 
> In fact xml4j creates with its (validating) DOMParser a structure like
> 
> Element TABLE
>    Text "\n"
>    Element TBODY
>       Text "\n"
>       Element TR
>          ...
>       Text "\n"
>       Element TR
>          ...
>       Text "\n"
>    Text "\n"
> 
> My first opinion was that those Text nodes "\n" are unnecessary.
> But after thinking it over I am a little bit in doubt ...
> 
> Any comments?
> 
> Regards,
> Oliver
> 
> /-------------------------------------------------------------------\
> |  ob|do        Dipl.Inf. Oliver Becker                             |
> |  --+--        E-Mail: obecker@informatik.hu-berlin.de             |
> |  op|qo        WWW:    http://www.informatik.hu-berlin.de/~obecker |
> \-------------------------------------------------------------------/

Received on Wednesday, 24 February 1999 11:45:36 UTC