- From: <bugzilla@wiggum.w3.org>
- Date: Fri, 04 Nov 2005 16:44:50 +0000
- To: public-qt-comments@w3.org
- Cc:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=2459 Summary: [Serialization] Phases of Character Expansion Product: XPath / XQuery / XSLT Version: Candidate Recommendation Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Serialization AssignedTo: scott_boag@us.ibm.com ReportedBy: mike@saxonica.com QAContact: public-qt-comments@w3.org Serialization section 4, list item 2, contains the rule: The substitution processes that apply are listed below, in priority order: a character that is handled by one process in this list will be unaffected by processes appearing later in the list, except that a character affected.... My question is, what does "handled" mean here? Does it mean the same as "affected"? Consider this example (a Saxon bug report today). The result tree contains <a href="mailto:sales@backbase.com">sales@backbase.com</a> and there is a character map that translates "@" to "(at)" Saxon is doing the character mapping for the text node but not for the attribute node, because the characters in the attribute node are all "handled" by the URI escaping phase, even though they are unchanged by that phase. Is this a correct interpretation? It isn't an interpretation that makes much sense for this use case, and I can't really think of a use case where it does make sense. So perhaps "handled" should be "affected", or even "altered". This leads me to question these rules from first principles. The rules have become increasingly messy. Let's look at all the interactions between phases: for reference these are a URI escaping b character mapping c unicode normalization d CDATA sections e ampersand escaping Looking at all possible pairs of phases, let's ask the question "should a character that's changed by the first phase also be changed by the second" ab - unlikely to affect practical use cases ac - makes no difference (we have recently added a new rule to normalize before URI escaping) ad - makes no difference ae - makes no difference bc - probably yes, though currently no bd - makes no difference (we have a special rule here that elements specified as cdata-section-elements are not affected by character mapping) be - definitely no cd - definitely yes (the exception to the general rule is already stated) ce - definitely yes (the exception to the general rule is already stated) de - definitely no This is far from the blanket "no" that the general rule implies. I think we could make the whole thing a lot simpler by inverting the general rule, so that characters output by one phase act as input to the next, with stated exceptions: (1) an & or < character produced as a result of string replacement in a character map is not ampersand-escaped in step (e) (2) steps (d) and (e) are alternatives, rather than being sequential: a text node is either processed by (d) or by (e). Michael Kay previously raised at http://lists.w3.org/Archives/Member/w3c-xsl-query/2005Oct/0010.html
Received on Friday, 4 November 2005 16:44:57 UTC