HTML output method: indenting limitations

XSLT 2.0 section 20.3 says:

  "If the indent attribute has the value yes, then the html output method
  may add or remove whitespace as it outputs the result tree, so long as it
  does not change how an HTML user agent would render the output."

As I mentioned in a previous message (XSLT 1.1 era) [1], this imposes undue
burden on the XSLT processor to anticipate how an HTML user agent "should"  
(I disagree with the use of the word "would") render the output. In order to
follow the guideline to the letter, the XSLT processor must be aware of what
script (writing system) is in use where the whitespace would be added -- an
impossible feat, given the overlap of character repertoires among scripts --
so that it can guess at how the resulting inter-word space would be rendered.  
See http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1 for details.

Along the same lines, I also mentioned:

  There is some ambiguity [in the HTML spec] about how to determine where
  an inter-word space needs to be rendered (for example, if it appears on
  one side of an inline image), so HTML user agents are not entirely
  consistent in this regard.

Further, HTML has a different notion than XML of what constitutes "white
space" (note the, ahem, space between the words there). These are the HTML
"white space" characters:
  #x9
  #xA
  #xC (form feed; disallowed in XML!)
  #xD
  #x20
  #x200B (zero-width space)
...so you may want to clarify whether the reference to "whitespace" that can 
be added to HTML output includes those FF and ZWS characters.

In addition, HTML output rendering is also affected by CSS, scripting, and
non-HTML elements and processing instructions. These factors should be
mentioned.

Finally, the HTML spec makes various non-normative recommendations of
"good practice" for user agents to follow, as mentioned in the third
paragraph of HTML 4.01 section 4. Some of these recommendations affect
rendering, so you may want to clarify whether they should be taken into
account in the determination of how an HTML user agent "should" render
the output.

In conclusion, it would leave less room for interpretation and variance
among the output produced by XSLT processors if the guidelines for
indenting HTML were changed to something like the following:

  If the indent attribute has the value yes, then the html output method
  may add or remove HTML "white space" characters as defined in HTML 4.01
  section 9.1 as it outputs the result tree, so long as it does not
  significantly change how an HTML user agent should render the output.

  The html output method should assume that the user agent will follow the
  HTML 4.01 recommendations for good practice, and will follow the SGML
  line break rules mentioned in HTML 4.01 section B.3.1. The method should
  disregard unpredictable rendering factors such as CSS, client-side
  scripting, and the effects of non-HTML elements and processing
  instructions.

  It is not recommended that the form feed (#xC) character, which is
  disallowed in XML, be used for indenting, even though it is allowed in
  HTML.

  The default value of the indent attribute is yes.

Thanks.

[1] http://lists.w3.org/Archives/Public/xsl-editors/2000JulSep/0041.html

Mike

-- 
  Mike J. Brown   |  http://skew.org/~mike/resume/
  Denver, CO, USA |  http://skew.org/xml/

Received on Thursday, 13 February 2003 02:22:22 UTC