Re: pfps-04 (why the thread is germane to pfps-04)

From: Martin Duerst <duerst@w3.org>
Subject: Re: pfps-04 (why the thread is germane to pfps-04)
Date: Sun, 27 Jul 2003 21:59:31 -0400

> At 21:33 03/07/25 -0400, Peter F. Patel-Schneider wrote:
> 
> >From: Martin Duerst <duerst@w3.org>
> 
> > > At 07:54 03/07/25 -0400, Peter F. Patel-Schneider wrote:
> 
> > > >However other answers are harder to determine.
> > > >
> > > >1/ When is an XML literal equal to a plain RDF literal?  A plain RDF
> > > >literal is a Unicode string (sequence of Unicode characters), so this
> > > >question boils down to whether octets and Unicode characters are disjoint.
> > > >I found it difficult to answer this question, because of hints in the
> > > >exclusive canonicalization document that they are not.
> > >
> > > Can you point to the places where you saw such hints. If there are
> > > such hints, then they definitely have to be fixed, and I'll make
> > > sure that this happens.
> >
> >The examples in Section 2 of
> >http://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/ give canonical XML
> >documents as if they were sequences of Unicode characters.  This indicates
> >that octets are Unicode characters.
> 
> There is an explicit counterexample at
> http://www.w3.org/TR/2001/REC-xml-c14n-20010315#Example-UTF8.
> Is this good enough, or not? If not, I'll ask the XML Signature
> people to add a clarification.
> 
> I can understand that this may be a bit confusing. But in some
> way, it's the same as for integers: We can't directly visualize
> integers. So we use strings of digits (characters) to show them.
> The same with octets: We can't directly visualize sequences of
> octets. So we use sequences of characters to show them.
> 
> 
> Regards,    Martin.

I think that if anything this example, and the others in the same section,
point in the other direction.

In each of the examples, the canonical form is presented in the same way as
the input document, indicating that the canonical form shares
characteristics with the input document.  

In the specific example, the difference pointed out is between ``the string
#C2#A9'' and ``the two octets whose hexadecimal values are C2 and C9''.
The first is definitely a six-character Unicode string.  The second is
definitely not a six-character Unicode string, but still might be a
two-character Unicode string.  This possibility is enhanced by the rest of
the example.

If the example also said
	... is also NOT the two Unicode characters whose code points are
	hex C2 and hex A9 ... 
then the example would be very explicit that the canonical form is not a
Unicode string. 

peter

Received on Monday, 28 July 2003 07:44:04 UTC