Attribute normalization

Hello,

     Consider applying the following stylesheet to any input XML document. 
 Note the end-of-line that is part of the content of the xsl:text element.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">
  <xsl:template match="/">
    <out>
      <xsl:attribute name="attr">
        <xsl:text>&#9;
</xsl:text>
        </xsl:attribute>
    </out>
  </xsl:template>
</xsl:stylesheet>

     Which of the following are ways in which a processor should serialize 
the "attr" attribute?  The form "[U+xxxx]" indicates that the actual 
Unicode character appears at that point in the serialized result, as 
opposed to a character reference.

(i)   attr="[U+0009]&#10;"
(ii)  attr="&#9;&#10;"

     According to Section 7.1.3 of XSLT 1.0 [1], "Note:  When an xsl:attribute contains a text node with a newline, then the 
XML output must contain a character reference. . . .  This is because XML 1.0 requires newline characters in attribute values to 
be normalized into spaces but requires character references to newline 
characters not to be normalized."

     Is this note intended to be an exhaustive list of the situations in 
which character references must be used because of the Attribute-Value 
Normalization rules of XML 1.0 [2]?  Some read it as exhaustive, and 
believe that either serialized form for the attribute is admissible; 
others read it as simply an example, and believe that only the form marked 
(ii) should be used to serialize the result, as the first form would not 
yield a document with the same Infoset.

     Would the answer be different for a stylesheet like the following, 
which has no xsl:attribute element?

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">
  <xsl:template match="/">
    <out attr="&#9;&#10;"/>
  </xsl:template>
</xsl:stylesheet>


     Section 4 of the XSLT 2.0 and XQuery 1.0 Serialization draft [3], of 
course, is explicit, stating in part that "certain whitespace characters should be output as character references, to 
ensure that they survive the round trip through serialization and parsing. 
Specifically, CR characters in text nodes should be written as &#xD; or an 
equivalent; while CR, NL, and TAB characters in attribute nodes should be 
output respectively as &#xD;, &#xA;, and &#x9;, or their equivalents." But 
it's not clear whether that's a change in behaviour or a clarification of 
something that was not clearly described in XSLT 1.0.

Thanks,

Henry
[1] http://www.w3.org/TR/xslt#creating-attributes
[2] http://www.w3.org/TR/2000/REC-xml-20001006#AVNormalize
[3] http://www.w3.org/TR/xslt-xquery-serialization/#xml-output
------------------------------------------------------------------
Henry Zongaro      Xalan development
IBM SWS Toronto Lab   T/L 969-6044;  Phone +1 905 413-6044
mailto:zongaro@ca.ibm.com

Received on Monday, 8 September 2003 16:21:50 UTC