- From: Martin Duerst <duerst@w3.org>
- Date: Mon, 28 Jul 2003 10:16:31 -0400
- To: "Peter F. Patel-Schneider" <pfps@research.bell-labs.com>
- Cc: bwm@hplb.hpl.hp.com, jjc@hplb.hpl.hp.com, phayes@ai.uwf.edu, www-rdf-comments@w3.org, w3c-i18n-ig@w3.org, msm@w3.org
Hello Peter, At 07:42 03/07/28 -0400, Peter F. Patel-Schneider wrote: >From: Martin Duerst <duerst@w3.org> > > >The examples in Section 2 of > > >http://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/ give canonical XML > > >documents as if they were sequences of Unicode characters. This indicates > > >that octets are Unicode characters. > > > > There is an explicit counterexample at > > http://www.w3.org/TR/2001/REC-xml-c14n-20010315#Example-UTF8. > > Is this good enough, or not? If not, I'll ask the XML Signature > > people to add a clarification. > > > > I can understand that this may be a bit confusing. But in some > > way, it's the same as for integers: We can't directly visualize > > integers. So we use strings of digits (characters) to show them. > > The same with octets: We can't directly visualize sequences of > > octets. So we use sequences of characters to show them. > > > > > > Regards, Martin. > >I think that if anything this example, and the others in the same section, >point in the other direction. > >In each of the examples, the canonical form is presented in the same way as >the input document, indicating that the canonical form shares >characteristics with the input document. > >In the specific example, the difference pointed out is between ``the string >#C2#A9'' and ``the two octets whose hexadecimal values are C2 and C9''. >The first is definitely a six-character Unicode string. The second is >definitely not a six-character Unicode string, but still might be a >two-character Unicode string. It is the representation of a one-character Unicode string. There is a note at the end of the example saying: Note: The content of the doc element is NOT the string #xC2#xA9 but rather the two octets whose hexadecimal values are C2 and A9, which is the UTF-8 encoding of the UCS codepoint for the copyright sign ((c)). >This possibility is enhanced by the rest of >the example. > >If the example also said > ... is also NOT the two Unicode characters whose code points are > hex C2 and hex A9 ... Well, it says that it's the representation for the copyright sign, and the codepoint of the copyright sign is U+00A9. The fact that the second octet in the UTF-8 representation of the copyright sign is similar to the hexadecimal representation of the codepoint of the copyright is coincidential. For example, the character e-acute has codepoint U+00E9 but is represented in UTF-8 by an octet C3 followed by an octet A9. Hope this helps. Regards, Martin. >then the example would be very explicit that the canonical form is not a >Unicode string. > >peter
Received on Monday, 28 July 2003 13:26:52 UTC