HTML output method and whitespace

I have a few issues with the XSLT PR's HTML output method section.

http://www.w3.org/TR/xslt#section-HTML-Output-Method says:

"The version attribute indicates the version of the HTML. The default value
is 4.0, which specifies that the result should be output as HTML conforming
to the HTML 4.0 Recommendation."

and 

"If the indent attribute has the value yes, then the html output method may
add or remove whitespace as it outputs the result tree, so long as it does
not change how an HTML user agent would render the output."

This is somewhat vague for a normative reference.


1. How is one supposed to know how an HTML user agent *would* render the
output? The implications are that if different user agents treat the same
whitespace differently, then the XSLT processor is not conformant by writing
the whitespace the way it does.

The conformance of an XSLT processor that implements the html output method
should not be dependent on both the confrmance of the HTML to the
appropriate specification *and* the largely unpredictable behavior of
existing user agents.

So, changing "would" to "should" would be helpful. But this change alone is
inadequate because the HTML 4.0 Recommendation, for example, contains both
normative requirements and non-normative recommendations for good practice
among user agents.

2. How *should* an HTML user agent render the output? Well, the HTML 4.0
Recommendation indicates that whitespace may be rendered differently
depending on the output script. More on this below.

3. What is whitespace? There are differences between what is considered
whitespace in XML/XSLT and what is considered whitespace in HTML.

http://www.w3.org/TR/xslt#strip says this about "whitespace":

"As in XML, a whitespace character is #x20, #x9, #xD or #xA."

This is a subset of what HTML considers whitespace. The HTML spec says at
http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1 and
http://www.w3.org/TR/REC-html40/struct/text.html#line-breaks that "white
space" is the individual characters #x20, #x9, #xC, #x200B, #x9, #xD, and
the #x9 #xD pair.

I realize that an XSLT processor is not likely to emit a zero-width word
separator or an ASCII form-feed character, but it is worth noting this
incongruity in the definition of whitespace because it is so significant in
HTML. The HTML 4.0 Recommendation goes on to say:

"For all HTML elements except PRE, sequences of white space separate "words"
(we use the term "word" here to mean "sequences of non-white space
characters"). When formatting text, user agents should identify these words
and lay them out according to the conventions of the particular written
language (script) and target medium. 

This layout may involve putting space between words (called inter-word
space), but conventions for inter-word space vary from script to script. For
example, in Latin scripts, inter-word space is typically rendered as an
ASCII space ( ), while in Thai it is a zero-width word separator
(​). In Japanese and Chinese, inter-word space is not typically
rendered at all. 

Note that a sequence of white spaces between words in the source document
may result in an entirely different rendered inter-word spacing (except in
the case of the PRE element). In particular, user agents should collapse
input white space sequences when producing output inter-word space."


So, it is impossible to test whether and how the inclusion of whitespace
"should" affect the rendering of the HTML that an XSLT processor is putting
out.


I suggest changing http://www.w3.org/TR/xslt#output:

"indent specifies whether the XSLT processor may add additional whitespace
when outputting the result tree; the value must be yes or no"

to

"indent specifies whether the XSLT processor may add additional HTML white
space when outputting the result tree; the value must be yes or no"

However I can see why this would be confusing if HTML white space is not
defined, or if different HTML specifications define white space differently.


I suggest changing http://www.w3.org/TR/xslt#section-HTML-Output-Method:

"If the indent attribute has the value yes, then the html output method may
add or remove HTML white space as it outputs the result tree, so long as it
does not change how an HTML user agent would render the output. The default
value is yes."

to

"If the indent attribute has the value yes, then the html output method may
add or remove whitespace as it outputs the result tree, so long as it does
not significantly change how an HTML user agent that conforms to the
appropriate specification and follows the specification's non-normative
recommendations for good practice should render the output. The default
value is yes."

Perhaps this isn't the best rephrasing, but it's a place to start.


3. The HTML 4.0 Recommendation is referenced several times, perhaps even
normatively, but it is not listed in the References sections at all.


What brought all this to my attention was XT's behavior of adding newlines
after inline images, objects and applets. According to HTML 4.0, these
inline elements are supposed to be rendered such that their bottom edge is
even with the baseline of "adjacent text" (which might be before or after
the element, your guess is as good as mine).

If viewing with a Western script, the newline is collapsed to a printing
space character, which is rendered in the appropriate font at that location.
The baseline is above the bottom edge of the space allocated for that "row"
of text (a notion further complicated by vertical scripts, I'm sure) since
some glyphs need room for descenders. The next row of text is aligned
vertically with respect to the preceding row. Vertical alignment of an
inline image, object or applet will follow.

Consequently, <img><br><img> and
<img>
<br>
<img>
are two different things and can result in different renderings, as
demonstrated in some examples at
http://www.skew.org/xml/misc_demos/whitespace/

Lovely, eh?

-Mike

Received on Monday, 8 November 1999 22:26:12 UTC