repost/summary of some outstanding output issues for XSLT 1.1 from Mike Brown on 2000-08-30 (xsl-editors@w3.org from July to September 2000)

From: Mike Brown <mike@skew.org>
Date: Tue, 29 Aug 2000 22:54:21 -0600 (MDT)
To: xsl-editors@w3.org
CC: xsl-list@mulberrytech.com
Message-Id: <200008300454.WAA10239@skew.org>
I have reported to xsl-editors a few output related issues that I would
like to see receive some attention in XSLT 1.1. When I mentioned these
before, no discussion followed. I will summarize and restate them here.

I believe these issues all fall into the realm of more fully specifying
non-erroneous behaviors. Addressing these issues would serve to
standardize features already implemented in several XSLT 1.0 processors,
and thus they are within the scope of the stated general requirements.
I understand that these are lower priority than the other issues in the
requirements doc.

On to the issues...

1. What HTML calls "white space" and what the XML and XSLT recommendations
call "whitespace" are two different things. XML has 3 whitespace
characters; HTML has 6 characters and 1 pair of characters that are
considered white space. 

Caution must be used when inserting white space characters when indenting,
because in most HTML elements, sequences of consecutive white space
characters are collapsed into a single inter-word space, which is rendered
according to the appropriate human language script for the adjacent spans
of text. There is some ambiguity about how to determine where an
inter-word space needs to be rendered (for example, if it appears on one
side of an inline image), so HTML user agents are not entirely consistent
in this regard.

It would leave less room for interpretation and variance among the output
produced by XSLT processors if the following guideline for indenting HTML
were changed. I suggest changing this phrase in
http://www.w3.org/TR/xslt#section-HTML-Output-Method:

"If the indent attribute has the value yes, then the html output method
may add or remove whitespace as it outputs the result tree, so long as it
does not change how an HTML user agent would render the output. The
default value is yes."

to

"If the indent attribute has the value yes, then the html output method
may add or remove HTML white space as it outputs the result tree, as long
as it does not significantly change how an HTML user agent following the
HTML specification's informative recommendations for good practice should
render the output. The default value is yes."


2. <script> and <style> elements are recommended as having output escaping
disabled when emitted via the "html" output method, but no recommendation
is made for "script data"-type attribute values -- attributes whose
content model appears as %Script; in the HTML DTDs. There are too many of
these to enumerate here, but they should be included in the
recommendation. Again, this affects portability of stylesheets because
processors could choose to escape attribute values with script content.


3. XSLT document authors often want to construct URI strings with
XPath/XSLT functions and put them in certain attributes. It is not just
limited to HTML; there are also various applications where URIs need to be
used as the values of reserved attributes in XML based languages.

RFC 2396 mandates that URI strings be escaped per certain conventions.
Using pure XSLT there is no way to effectively achieve proper escaping
when constructing the URI strings.* Consequently, a demand exists for XSLT
processors to make some effort to perform URI escaping on the values of
certain attributes, at least when the output method is "html".

Implementors and users of XSLT processors have been debating how to
achieve this, resulting in differing implementations and in turn, making
stylesheets less portable, because output may be useless if all the href
and src attributes are munged.

If a pure XSLT solution for performing URI escaping on a given string
(intended to be used while constructing URI strings, not after the fact)
cannot be achieved in this next revision, then an informative statement
should be added to the Output section of the XSLT 1.1 spec, saying
something like this:

   Escaping of URI strings

   URI strings are by definition already escaped; if a string
   contains characters that are not allowed to exist in a URI,
   then it is not a URI. It is the responsibility of the document
   author to perform the appropriate escaping when constructing
   the string. Since XSLT does not offer a convenient mechanism
   for performing URI escaping, extension functions are necessary
   to achieve this goal.

   As a workaround, XSLT processors may, but are not required to,
   attempt to perform some degree of URI escaping, as specified
   in [RFC 2396], when outputting the values of certain attributes
   that are required to be URIs. For example, when the output
   method is "html", attributes whose content model appears as
   %URI; in the appropriate HTML DTD may be escaped upon output.

   Because such an attribute value may already be a properly
   escaped URI, double escaping may occur, possibly changing the
   meaning of the URI. Therefore, if an XSLT processor can perform
   automatic escaping, it should also provide a mechanism for
   disabling this behavior.

Perhaps this suggestion, too, is insufficient?


Original posts where these issues are explained further:

http://lists.w3.org/Archives/Public/xsl-editors/1999OctDec/0033.html
http://lists.w3.org/Archives/Public/xsl-editors/2000AprJun/0069.html

However, please consider the suggestions as they are worded in this
message to be more current than those in the old messages.

Thanks and respect,

   - Mike
____________________________________________________________________
Mike J. Brown, software engineer at         My XML/XSL resources:
webb.net in Denver, Colorado, USA           http://www.skew.org/xml/


* Actually it is not impossible to do URI escaping in pure XSLT, but after
experimenting a bit I came to the conclusion that it would require
building a lookup string consisting of all 1.2 million characters that can
be in an XML document. Their relative positions in the string could then
be used to deduce their Unicode scalar values, from which a UTF-8 octet
sequence can be derived and converted to %xx escapes. A unicode-scalar()
function for converting the first character of a given string to a number
that is its Unicode scalar value would be most helpful, as would a hex()
function for converting a number to a hexadecimal string equivalent. Then
it would be a matter of pretty simple arithmetic to convert the scalar
values to the appropriate %xx sequences...
Received on Wednesday, 30 August 2000 00:54:34 UTC