Re: C14N canonicalization from John Boyer on 2005-10-05 (w3c-ietf-xmldsig@w3.org from October to December 2005)

From: John Boyer <boyerj@ca.ibm.com>
Date: Wed, 5 Oct 2005 09:20:42 -0700
To: Szabó Áron <aron@ik.bme.hu>
Cc: w3c-ietf-xmldsig@w3.org, w3c-ietf-xmldsig-request@w3.org
Message-ID: <OF08A4A7F9.B5095F81-ON88257091.00585BA7-88257091.0059C94F@ca.ibm.com>

The one thing you have to be careful about when reading the spec is to 
make sure you're
interpreting the sentences in the context in which they appear.

In this case, you extracted a sentence that appears in the 
canonicalization of *attribute* nodes,
but none of your example questions below pertain to the canonicalization 
of attributes.

If you were trying to canonicalize an attribute, though, you would find 
that, for example, 
return-newline characters that are presented to the data model would be 
output as &#xD;&#xA;

Of course, the reason for this character reference encoding has to do with 
how you managed
to get return newline sequences past the attribute value normalization and 
into the data model
(info set) in the first place.

Cheers.

John M. Boyer, Ph.D.
Senior Product Architect/Research Scientist
Workplace, Portal and Collaboration Software
IBM Victoria Software Lab
E-Mail: boyerj@ca.ibm.com  http://www.ibm.com/software/





Szabó Áron <aron@ik.bme.hu> 
Sent by: w3c-ietf-xmldsig-request@w3.org
10/05/2005 08:16 AM

To
<w3c-ietf-xmldsig@w3.org>
cc

Subject
C14N canonicalization







Dear Members,

I'm checking several parsers + C14N canonicalization solutions to provide
interoperability between applications. I've noticed strange functioning,
therefore I've read through again the W3C C14N standard, but I couldn't 
find
out which the correct way is. Could you please help me in explaining the
text of the standard?

What does this sentence exactly mean?

"The string value of the node is modified by replacing all ampersands (&)
with &amp;, all open angle brackets (<) with &lt;, all quotation mark
characters with &quot;, and the whitespace characters #x9, #xA, and #xD,
with character references. The character references are written in 
uppercase
hexadecimal with no leading zeroes (for example, #xD is represented by the
character reference &#xD;)."
(http://www.w3.org/TR/xml-c14n)

The following example was given as input for parsing and C14N
canonicalization:

<doc>
   <e1/>
</doc>

which contains the bit sequence (in hex) of

"0D 0A 20 20 20".

between the two tags.

I've got outputs (made by several applications) that contained e.g.

"0A 20 20 20" (in this case the escaped "#xD" character is missing, but I
think this is the correct way)

"0A 09" (the three hex "20" have been converted to hex "09" which is TAB)

"26 23 78 44 3B 0A 20 20 20" (in which "26 23 78 44 3B" is "&#xD;")

Which is the correct one? Any idea?

Best regards,
Aron

----------------------------------------------------
Aron Szabo, M. Sc.
Research Associate,
Center of Information Technology
Budapest University of Technology and Economics

Received on Wednesday, 5 October 2005 16:20:58 UTC