Re: pfps-04 (why the thread is germane to pfps-04)

Hello Peter,

At 07:42 03/07/28 -0400, Peter F. Patel-Schneider wrote:

>From: Martin Duerst <duerst@w3.org>

> > >The examples in Section 2 of
> > >http://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/ give canonical XML
> > >documents as if they were sequences of Unicode characters.  This indicates
> > >that octets are Unicode characters.
> >
> > There is an explicit counterexample at
> > http://www.w3.org/TR/2001/REC-xml-c14n-20010315#Example-UTF8.
> > Is this good enough, or not? If not, I'll ask the XML Signature
> > people to add a clarification.
> >
> > I can understand that this may be a bit confusing. But in some
> > way, it's the same as for integers: We can't directly visualize
> > integers. So we use strings of digits (characters) to show them.
> > The same with octets: We can't directly visualize sequences of
> > octets. So we use sequences of characters to show them.
> >
> >
> > Regards,    Martin.
>
>I think that if anything this example, and the others in the same section,
>point in the other direction.
>
>In each of the examples, the canonical form is presented in the same way as
>the input document, indicating that the canonical form shares
>characteristics with the input document.
>
>In the specific example, the difference pointed out is between ``the string
>#C2#A9'' and ``the two octets whose hexadecimal values are C2 and C9''.
>The first is definitely a six-character Unicode string.  The second is
>definitely not a six-character Unicode string, but still might be a
>two-character Unicode string.

It is the representation of a one-character Unicode string. There
is a note at the end of the example saying:

Note: The content of the doc element is NOT the string #xC2#xA9 but
rather the two octets whose hexadecimal values are C2 and A9, which
is the UTF-8 encoding of the UCS codepoint for the copyright sign ((c)).


>This possibility is enhanced by the rest of
>the example.
>
>If the example also said
>         ... is also NOT the two Unicode characters whose code points are
>         hex C2 and hex A9 ...

Well, it says that it's the representation for the copyright sign,
and the codepoint of the copyright sign is U+00A9. The fact that
the second octet in the UTF-8 representation of the copyright sign
is similar to the hexadecimal representation of the codepoint of
the copyright is coincidential. For example, the character e-acute
has codepoint U+00E9 but is represented in UTF-8 by an octet C3
followed by an octet A9.

Hope this helps.

Regards,    Martin.

>then the example would be very explicit that the canonical form is not a
>Unicode string.
>
>peter

Received on Monday, 28 July 2003 13:26:52 UTC