[Bug 2459] [Serialization] Phases of Character Expansion

http://www.w3.org/Bugs/Public/show_bug.cgi?id=2459

           Summary: [Serialization] Phases of Character Expansion
           Product: XPath / XQuery / XSLT
           Version: Candidate Recommendation
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Serialization
        AssignedTo: scott_boag@us.ibm.com
        ReportedBy: mike@saxonica.com
         QAContact: public-qt-comments@w3.org


Serialization section 4, list item 2, contains the rule:

The substitution processes that apply are listed below, in priority order: a
character that is handled by one process in this list will be unaffected by
processes appearing later in the list, except that a character affected....

My question is, what does "handled" mean here? Does it mean the same as
"affected"?

Consider this example (a Saxon bug report today). The result tree contains

<a href="mailto:sales@backbase.com">sales@backbase.com</a>

and there is a character map that translates "@" to "(at)"

Saxon is doing the character mapping for the text node but not for the
attribute node, because the characters in the attribute node are all
"handled" by the URI escaping phase, even though they are unchanged by that
phase. Is this a correct interpretation? It isn't an interpretation that
makes much sense for this use case, and I can't really think of a use case
where it does make sense. So perhaps "handled" should be "affected", or even
"altered".

This leads me to question these rules from first principles. The rules have
become increasingly messy. Let's look at all the interactions between
phases: for reference these are

a   URI escaping
b   character mapping
c   unicode normalization
d   CDATA sections
e   ampersand escaping

Looking at all possible pairs of phases, let's ask the question "should a
character that's changed by the first phase also be changed by the second"

ab  - unlikely to affect practical use cases
ac  - makes no difference (we have recently added a new rule to normalize
before URI escaping)
ad  - makes no difference
ae  - makes no difference
bc  - probably yes, though currently no
bd  - makes no difference (we have a special rule here that elements
specified as cdata-section-elements are not affected by character mapping)
be  - definitely no
cd  - definitely yes (the exception to the general rule is already stated)
ce  - definitely yes (the exception to the general rule is already stated)
de  - definitely no

This is far from the blanket "no" that the general rule implies. I think we
could make the whole thing a lot simpler by inverting the general rule, so
that characters output by one phase act as input to the next, with stated
exceptions:

(1) an & or < character produced as a result of string replacement in a
character map is not ampersand-escaped in step (e)

(2) steps (d) and (e) are alternatives, rather than being sequential: a text
node is either processed by (d) or by (e).

Michael Kay
previously raised at 
http://lists.w3.org/Archives/Member/w3c-xsl-query/2005Oct/0010.html

Received on Friday, 4 November 2005 16:44:57 UTC