RE: Character Encoding Question from John Boyer on 2000-11-29 (w3c-ietf-xmldsig@w3.org from October to December 2000)

From: John Boyer <jboyer@PureEdge.com>
Date: Tue, 28 Nov 2000 17:20:15 -0800
To: "Tom Gindin" <tgindin@us.ibm.com>
Cc: "Martin J. Duerst" <duerst@w3.org>, <w3c-ietf-xmldsig@w3.org>
Message-ID: <BFEDKCINEPLBDLODCODKIEKMCGAA.jboyer@PureEdge.com>
Hi Tom,

UTF-8 and UTF-16 are both encodings of UCS, whether UCS-2 or UCS-4.  If I
understand correctly, UCS-n is a character domain used during processing,
and UTF-n is used for input and output.

I am certain that UTF-8 character sequences can encode UCS-4, and I know
that UCS-2 is a two byte per char character domain for use in processing
scenarios where only the BMP is required.  This would be all scenarios right
now (according to the Unicode 3.0 manual) because nothing is yet defined
outside of the BMP, although ISO 10646-2 is likely to change that (again,
according to the Unicode 3.0 manual).

One thing I don't know for sure is whether Unicode == UCS-2?

If so, then our current sentence is certain wrong because I'm sure we don't
mean that NFC should be applied to UTF-8 and UTF-16 encodings of UCS-n.

If Unicode != UCS-2, then A) what's the difference, and B) it would be
helpful if someone would confirm whether UCS-n is ever used for
transportation of character data, or whether this is done solely by UTF-n
formats.  If so, will it continue in this fashion in the future?

Finally, I don't know whether anyone reads or writes UCS-n data directly,
but I do know that our intent was that UTF-n data would not have NFC applied
to it.

Thanks,
John Boyer
Team Leader, Software Development
Distributed Processing and XML
PureEdge Solutions Inc.
Creating Binding E-Commerce
v: 250-479-8334, ext. 143  f: 250-479-3772
1-888-517-2675   http://www.PureEdge.com <http://www.pureedge.com/>



-----Original Message-----
From: w3c-ietf-xmldsig-request@w3.org
[mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of Tom Gindin
Sent: Tuesday, November 28, 2000 4:53 PM
To: John Boyer
Cc: Martin J. Duerst; w3c-ietf-xmldsig@w3.org
Subject: Re: Character Encoding Question



     Is what is meant "... from an encoding which is neither a UCS-n
encoding nor a UTF-n encoding"?  That would seem to cover UCS-2, UCS-4,
UTF-8, and UTF-16 (along with UTF-7 for good measure).  If UTF-8 is not
included, although the NFC transformation would seem to have no effect on
it, just replace "UTF-n" by "UTF-16" in the sentence above.

          Tom Gindin

"John Boyer" <jboyer@PureEdge.com>@w3.org on 11/28/2000 05:39:27 PM

Sent by:  w3c-ietf-xmldsig-request@w3.org


To:   "Martin J. Duerst" <duerst@w3.org>, <w3c-ietf-xmldsig@w3.org>
cc:
Subject:  Character Encoding Question



Hi Martin and group,

I received a letter today from Jeff Cochran (JCochran@docutouch.com)
regarding a tweak that would appear to be needed regarding c14n and xml
signature.

The I18N group asked us to include a sentence along the lines of "REQUIRED
to use Normalization Form C [NFC] when converting an XML document to the
UCS
character domain from a non-Unicode encoding".

Apparently this is not exactly what is meant since UCS-4 character planes
outside of the BMP are technically non-Unicode.  The point Jeff makes is
that he doesn't know whether to apply NFC to UCS data that appears outside
of the BMP.

Question:  Should the statement be rewritten?  If so, how?

Thanks,
John Boyer
Team Leader, Software Development
Distributed Processing and XML
PureEdge Solutions Inc.
Creating Binding E-Commerce
v: 250-479-8334, ext. 143  f: 250-479-3772
1-888-517-2675   http://www.PureEdge.com <http://www.pureedge.com/>



-----Original Message-----
From: w3c-ietf-xmldsig-request@w3.org
[mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of Martin J. Duerst
Sent: Friday, November 24, 2000 6:17 PM
To: w3c-ietf-xmldsig@w3.org
Cc: lilley@w3.org
Subject: Fwd: I18N problem in XML canonicalisation


Chris Lilley just pointed out the following problem
in C14N. I think this at least has to be explained
much more clearly in the notes.

>http://www.w3.org/TR/xml-c14n#Example-UTF8
>
>Demonstrates using *two* NCRs foa single UTF-8 character (because it uses
>two bytes in UTF8 !!!

It's not really NCRs. It's a special notation to stand in for byte values.


>I suspect you may have a problem with that..... given that even surrogates
>use a single NCR not two. Also, its not clear the result is even
>wellformed!

There needs to be a much better note to make very clear that (different
to the other examples), this example is not really intended to be XML
and cannot be used directly in a test. It would also be advisable
to provide an actual file that contains the real bytes, or to point
to it if that's already around.

Regards,    Martin.
Received on Tuesday, 28 November 2000 20:20:27 UTC