Re: Restrict <CRLF> to the value \r\n ... the instance is <CRLF>\r\n</CRLF> ... error on validation -- why?

On Oct 24, 2012, at 7:32 AM, Michael Kay wrote:

> The Xerces parser is reporting the value of the "value" attribute to
> Saxon as two spaces. (The debugger also shows a private field
> indicating that the unnormalized value of the attribute is
> "&CR;&LF;" without the spaces.

> So it's XML attribute value normalization that's to blame.

Yes.

> If you wrote value="&#13;&#10;" then the value would not be
> normalized; I'm not sure why that isn't true if you use named entity
> references, but I'm sure someone has studied the small print.


For the record (and because I can't resist a good 
entity-expansion problem) ...

When the entity declarations

  <!ENTITY CR "&#13;">
  <!ENTITY LF "&#10;">

are processed, the parser should end up with entities named CR and LF,
each containing a string of length 1; the first containing character
U+000D, the second containing U+000A.

When the XSD pattern element 

  <xs:pattern value="&CR;&LF;"/>

is parsed, the 'value' attribute is processed as described in section
3.3.3 of the XML spec (http://www.w3.org/TR/xml/#AVNormalize):

- there are no literal carriage returns or line feeds, in the unparsed
text, so nothing happens there.

- the entity reference CR is expanded to a literal carriage return
(second bullet item in step 3 of the algorithm), and step 3 of the
normalization algorithm is applied recursively to its replacement
text.

- the literal carriage return is translated into a space (third bullet
  under step 3 of the normalization algorithm)

- the entity replacement text for CR has now been fully normalized and
  we pop back up a level.

- the entity reference to LF is expanded to a literal linefeed (bullet
  2 of step 3, again), and the normalization algorithm recurs again.

- the literal linefeed in the entity replacement text produces a space
  in the normalized text (again, third bullet of step 3).

- the LF entity's replacement text is now finished, and we pop a
  level.

- the attribute value is now finished, and we are done.

The construct <xs:pattern value="&CR;&LF;"/> is thus just a circuitous
way of writing <xs:pattern value=" "/> or <xs:pattern
value="&#x20;&#x20;"/>.

If numeric character references are used instead of general entity
references, the second and third bullets of step 3 do not fire;
instead the second bullet fires, and the result of attribute-value
normalization is a string containing a carriage return and a linefeed.

It would also work to change the entity declarations to

  <!ENTITY CR "&amp;#13;">
  <!ENTITY LF "&amp;#10;">

>> In an instance document the value of <CRLF> should be a carriage
>> return followed by a line feed, right?

Nope.

>> ... The pattern facet clearly specifies "\r\n"

Nope.


-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com 
* http://cmsmcq.com/mib                 
* http://balisage.net
****************************************************************

Received on Wednesday, 24 October 2012 22:15:45 UTC