RE: Character Encoding Question

RE: Character Encoding QuestionHi Jeff,

So far, every thing I've run into in the Unicode manual indicates that
UCS-2, Unicode and UTF-16x are equivalent.  In particular, they all encode
precisely the same thing, namely the 2^20 code points of the basic
multilingual plane (BMP).  Therefore, I interpret the note in RFC-2279 as
being reflective of the fact that the Unicode standard says a lot of things
other than just defining the character in the basic multilingual, but that
they are things that are useful to programmers, like info about converting
among UTF-n formats, info on byte order marks, historical information, etc.

In particular, the end of the sentence you cited is the strongest statement
of their equivalence: "changes in Unicode and amendments to ISO/IEC 10646
[which defines UCS and the BMP] have tracked each other, so that the
character repertoires and code point assignments have remained in sync."

Still your question is valid because UCS-4 contains code points outside of
the BMP, and UTF-8 is capable of encoding them, while Unicode/UCS-2/UTF-16x
is not.  While nothing currently exists out there, I think ISO/IEC 10646-2
is supposed to change that fact, so it would be helpful for us to change our
sentence about the conditions under which we expect the application of
Normalization Form C to occur.

As far as I know, the intent of the I18N folks was to have NFC applied to
transcodings of data in formats other than UCS-4, UTF-8, and Unicode (which,
as I said above, appears to be the same as saying
"Unicode/UCS-2/UTF-16/UTF-16BE/UTF-16LE").

In conclusion, it would be helpful to know whether anyone thinks UTF-7
(http://www.ietf.org/rfc/rfc2152.txt) should be included since it does claim
to be a format for encoding Unicode characters.

Thanks,
John Boyer
  -----Original Message-----
  From: w3c-ietf-xmldsig-request@w3.org
[mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of Jeff Cochran
  Sent: Wednesday, November 29, 2000 10:36 AM
  To: 'John Boyer'; Tom Gindin
  Cc: Martin J. Duerst; w3c-ietf-xmldsig@w3.org
  Subject: RE: Character Encoding Question


  John, Tom, others,

  I don't know if this helps or hurts, but my original comment stemmed from
my understanding that UCS-2 != Unicode. From RFC-2279 (which is actually the
UTF-8 RFC, but has this nice description), F. Yergeau:

  "It is noteworthy that the same set of characters [UCS-2, current BMP] is
defined by the Unicode standard [UNICODE], which further defines additional
character properties and other application details of great interest to
implementors, but does not have the UCS-4 encoding.  Up to the present time,
changes in Unicode and amendments to ISO/IEC 10646 have tracked each other,
so that the character repertoires and code point assignments have remained
in sync."

  My understanding is that for character point assignment, UCS-2 == Unicode
evaluates TRUE and will continue to evaluate TRUE. UCS-4 == Unicode will
always evaluate FALSE. "UCS" is probably the term desired in the Canonical
XML Version 1.0 specification. The fact that most implementors will use
Unicode to generate UCS character point data appears to be an implementation
detail. By using UCS as the specifier for character data, when characters
are added to the BMP and when those characters are outside of UCS-2/Unicode
and in UCS-4, the specification will continue to have meaning.

  Thanks,

  Jeff

  -----Original Message-----
  From: John Boyer [mailto:jboyer@PureEdge.com]
  Sent: Tuesday, November 28, 2000 5:20 PM
  To: Tom Gindin
  Cc: Martin J. Duerst; w3c-ietf-xmldsig@w3.org
  Subject: RE: Character Encoding Question



  Hi Tom,

  UTF-8 and UTF-16 are both encodings of UCS, whether UCS-2 or UCS-4.  If I
  understand correctly, UCS-n is a character domain used during processing,
  and UTF-n is used for input and output.

  I am certain that UTF-8 character sequences can encode UCS-4, and I know
  that UCS-2 is a two byte per char character domain for use in processing
  scenarios where only the BMP is required.  This would be all scenarios
right
  now (according to the Unicode 3.0 manual) because nothing is yet defined
  outside of the BMP, although ISO 10646-2 is likely to change that (again,
  according to the Unicode 3.0 manual).

  One thing I don't know for sure is whether Unicode == UCS-2?

  If so, then our current sentence is certain wrong because I'm sure we
don't
  mean that NFC should be applied to UTF-8 and UTF-16 encodings of UCS-n.

  If Unicode != UCS-2, then A) what's the difference, and B) it would be
  helpful if someone would confirm whether UCS-n is ever used for
  transportation of character data, or whether this is done solely by UTF-n
  formats.  If so, will it continue in this fashion in the future?

  Finally, I don't know whether anyone reads or writes UCS-n data directly,
  but I do know that our intent was that UTF-n data would not have NFC
applied
  to it.

  Thanks,
  John Boyer
  Team Leader, Software Development
  Distributed Processing and XML
  PureEdge Solutions Inc.
  Creating Binding E-Commerce
  v: 250-479-8334, ext. 143  f: 250-479-3772
  1-888-517-2675   http://www.PureEdge.com <http://www.pureedge.com/>




  -----Original Message-----
  From: w3c-ietf-xmldsig-request@w3.org
  [mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of Tom Gindin
  Sent: Tuesday, November 28, 2000 4:53 PM
  To: John Boyer
  Cc: Martin J. Duerst; w3c-ietf-xmldsig@w3.org
  Subject: Re: Character Encoding Question




       Is what is meant "... from an encoding which is neither a UCS-n
  encoding nor a UTF-n encoding"?  That would seem to cover UCS-2, UCS-4,
  UTF-8, and UTF-16 (along with UTF-7 for good measure).  If UTF-8 is not
  included, although the NFC transformation would seem to have no effect on
  it, just replace "UTF-n" by "UTF-16" in the sentence above.

            Tom Gindin

  "John Boyer" <jboyer@PureEdge.com>@w3.org on 11/28/2000 05:39:27 PM

  Sent by:  w3c-ietf-xmldsig-request@w3.org



  To:   "Martin J. Duerst" <duerst@w3.org>, <w3c-ietf-xmldsig@w3.org>
  cc:
  Subject:  Character Encoding Question




  Hi Martin and group,

  I received a letter today from Jeff Cochran (JCochran@docutouch.com)
  regarding a tweak that would appear to be needed regarding c14n and xml
  signature.

  The I18N group asked us to include a sentence along the lines of "REQUIRED
  to use Normalization Form C [NFC] when converting an XML document to the
  UCS
  character domain from a non-Unicode encoding".

  Apparently this is not exactly what is meant since UCS-4 character planes
  outside of the BMP are technically non-Unicode.  The point Jeff makes is
  that he doesn't know whether to apply NFC to UCS data that appears outside
  of the BMP.

  Question:  Should the statement be rewritten?  If so, how?

  Thanks,
  John Boyer
  Team Leader, Software Development
  Distributed Processing and XML
  PureEdge Solutions Inc.
  Creating Binding E-Commerce
  v: 250-479-8334, ext. 143  f: 250-479-3772
  1-888-517-2675   http://www.PureEdge.com <http://www.pureedge.com/>




  -----Original Message-----
  From: w3c-ietf-xmldsig-request@w3.org
  [mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of Martin J. Duerst
  Sent: Friday, November 24, 2000 6:17 PM
  To: w3c-ietf-xmldsig@w3.org
  Cc: lilley@w3.org
  Subject: Fwd: I18N problem in XML canonicalisation



  Chris Lilley just pointed out the following problem
  in C14N. I think this at least has to be explained
  much more clearly in the notes.

  >http://www.w3.org/TR/xml-c14n#Example-UTF8
  >
  >Demonstrates using *two* NCRs foa single UTF-8 character (because it uses
  >two bytes in UTF8 !!!

  It's not really NCRs. It's a special notation to stand in for byte values.



  >I suspect you may have a problem with that..... given that even
surrogates
  >use a single NCR not two. Also, its not clear the result is even
  >wellformed!

  There needs to be a much better note to make very clear that (different
  to the other examples), this example is not really intended to be XML
  and cannot be used directly in a test. It would also be advisable
  to provide an actual file that contains the real bytes, or to point
  to it if that's already around.

  Regards,    Martin.

Received on Wednesday, 29 November 2000 15:19:20 UTC