- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Wed, 24 Oct 2012 16:15:19 -0600
- To: Michael Kay <mike@saxonica.com>
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, xmlschema-dev@w3.org
On Oct 24, 2012, at 7:32 AM, Michael Kay wrote: > The Xerces parser is reporting the value of the "value" attribute to > Saxon as two spaces. (The debugger also shows a private field > indicating that the unnormalized value of the attribute is > "&CR;&LF;" without the spaces. > So it's XML attribute value normalization that's to blame. Yes. > If you wrote value=" " then the value would not be > normalized; I'm not sure why that isn't true if you use named entity > references, but I'm sure someone has studied the small print. For the record (and because I can't resist a good entity-expansion problem) ... When the entity declarations <!ENTITY CR " "> <!ENTITY LF " "> are processed, the parser should end up with entities named CR and LF, each containing a string of length 1; the first containing character U+000D, the second containing U+000A. When the XSD pattern element <xs:pattern value="&CR;&LF;"/> is parsed, the 'value' attribute is processed as described in section 3.3.3 of the XML spec (http://www.w3.org/TR/xml/#AVNormalize): - there are no literal carriage returns or line feeds, in the unparsed text, so nothing happens there. - the entity reference CR is expanded to a literal carriage return (second bullet item in step 3 of the algorithm), and step 3 of the normalization algorithm is applied recursively to its replacement text. - the literal carriage return is translated into a space (third bullet under step 3 of the normalization algorithm) - the entity replacement text for CR has now been fully normalized and we pop back up a level. - the entity reference to LF is expanded to a literal linefeed (bullet 2 of step 3, again), and the normalization algorithm recurs again. - the literal linefeed in the entity replacement text produces a space in the normalized text (again, third bullet of step 3). - the LF entity's replacement text is now finished, and we pop a level. - the attribute value is now finished, and we are done. The construct <xs:pattern value="&CR;&LF;"/> is thus just a circuitous way of writing <xs:pattern value=" "/> or <xs:pattern value="  "/>. If numeric character references are used instead of general entity references, the second and third bullets of step 3 do not fire; instead the second bullet fires, and the result of attribute-value normalization is a string containing a carriage return and a linefeed. It would also work to change the entity declarations to <!ENTITY CR "&#13;"> <!ENTITY LF "&#10;"> >> In an instance document the value of <CRLF> should be a carriage >> return followed by a line feed, right? Nope. >> ... The pattern facet clearly specifies "\r\n" Nope. -- **************************************************************** * C. M. Sperberg-McQueen, Black Mesa Technologies LLC * http://www.blackmesatech.com * http://cmsmcq.com/mib * http://balisage.net ****************************************************************
Received on Wednesday, 24 October 2012 22:15:45 UTC